python网络爬虫

运行环境:python3

BeautifulSoup4解析库

中文文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

BeautifulSoup4 是 HTML/XML 的解析器,主要的功能便是解析和提取 HTML/XML 中的数据。

Python中用于爬取静态网页的基本方法/模块有三种:正则表达式、BeautifulSoup和Lxml。三种方法的特点大致如下: python网络爬虫_第1张图片

beautifulSoup 的功能和 lxml 一样,但是 lxml 只会局部遍历数据,而 BeautifulSoup是基于HTML DOM的,所以会载入整个文档来解析整个DOM树。因此在性能上来说 BeautifulSoup 是低于lxml 的。

安装 BeautifulSoup4:

在 python3 中安装 BeautifulSoup4 的方法如下:

pip3 install beautifulsoup4

BeautifulSoup4使用

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

python网络爬虫_第2张图片

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

#bs.find_all(tagName, tagAttributes) 可以获取页面中所有指定的标签
nameList = bs.findAll('span', {'class':'green'})
title = bs.body.h1
print(title)

head=bs.findAll(['h1','h2'])
print(head)

nameList1 = bs.find_all(text='the prince')  #文本参数 text 有点不同,它是用标签的文本内容去匹配,而不是用标签的属性
print(len(nameList1))

for name in nameList:
    print(name.get_text())

bs.find_all(tagName, tagAttributes) 可以获取页面中所有指定的标签

BeautifulSoup的find()和find_all()

BeautifulSoup 文档里两者的定义就是这样:

  find_all(tag, attributes, recursive, text, limit, keywords)
  
  find(tag, attributes, recursive, text, keywords)
  

正则表达式和BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img',
                     {'src': re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:
    print(image['src'])

编写网络爬虫

全面彻底地抓取网站的常用方法是从一个顶级页面(比如主页)开始,然后搜索该页面上 的所有内链,形成列表。之后,抓取这些链接跳转到的每一个页面,再把在每个页面上找 到的链接形成新的列表,接着执行下一轮抓取。

1. 搜索维基百科上凯文 • 贝肯词条里所有指向其他词条的链接

  • 一个函数 getLinks,可以用一个 /wiki/< 词条名称 > 形式的维基百科词条 URL 作为参数, 然后以同样的形式返回一个列表,里面包含所有的词条 URL。

  • 一个主函数,以某个起始词条为参数调用 getLinks,然后从返回的 URL 列表里随机选 择一个词条链接,再次调用 getLinks,直到你主动停止程序,或者在新的页面上没有词 条链接了。

    完整的代码如下所示:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id': 'bodyContent'}).find_all('a',
                                                          href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links) - 1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

2.收集网站数据

通过观察几个维基百科页面,包括词条页面和非词条页面,比如隐私策略页 面,就会得出下面的规则。

  • 所有的标题(所有页面上,不论是词条页面、编辑历史页面还是其他页面)都是在 h1 → span 标签里,而且页面上只有一个 h1 标签。

  • 前面提到过,所有的正文文本都在 div#bodyContent 标签里。但是,如果我们只想获取 第一段文字,可能用 div#mw-content-text → p 更好(只选择第一段的标签)。这个规则 对所有内容页面都适用,除了文件页面(例如,https://en.wikipedia.org/wiki/File:Orbit_ of_274301_Wikipedia.svg),它们不包含内容文本(content text)部分。

  • 编辑链接只出现在词条页面上。如果有编辑链接,都位于 li#ca-edit 标签的 li#ca- edit → span → a 里面。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span')
              .find('a').attrs['href'])
    except AttributeError:
        print("页面缺少一些属性!不过不用担心!")
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
           if link.attrs['href'] not in pages:  # 我们遇到了新页面
               newPage = link.attrs['href']
               print('-' * 20)
               print(newPage)
               pages.add(newPage)
               getLinks(newPage)

爬chakracore的label为bug的网址:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen('https://github.com/chakra-core/ChakraCore/labels/Bug{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(\/chakra-core\/ChakraCore\/issues\/)[0-9]+')):
        if 'href' in link.attrs:
           if link.attrs['href'] not in pages:  # 我们遇到了新页面
               newPage = link.attrs['href']
               print('-' * 20)
               print(newPage)
               pages.add(newPage)
               getLinks(newPage)

getLinks('')

Scrapy

1.安装Scrapy:

 conda install -c conda-forge scrapy
  • 一个蜘蛛(spider)就是一 个 Scrapy 项目,和它的名称一样,就是用来爬网(抓取网页)的

  • “爬虫”(crawler)表示“任意用或不用 Scrapy 抓取网页的程序”

https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

2.编写第一个爬虫(Spider)

Spider是用户编写用于从单个网站(或者一些网站)爬取数据的类。

其包含了一个用于下载的初始URL,如何跟进网页中的链接以及如何分析页面中的内容, 提取生成 item 的方法。

为了创建一个Spider,您必须继承 scrapy.Spider 类, 且定义以下三个属性:

  • name: 用于区别Spider。 该名字必须是唯一的,您不可以为不同的Spider设定相同的名字。
  • start_urls: 包含了Spider在启动时进行爬取的url列表。 因此,第一个被获取到的页面将是其中之一。 后续的URL则从初始的URL获取到的数据中提取。
  • parse() 是spider的一个方法。 被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。 该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

创建项目

在开始爬取之前,您必须创建一个新的Scrapy项目。 进入您打算存储代码的目录中,运行下列命令:

scrapy startproject tutorial

该命令将会创建包含下列内容的 tutorial 目录:

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

这些文件分别是:

  • scrapy.cfg: 项目的配置文件
  • tutorial/: 该项目的python模块。之后您将在此加入代码。
  • tutorial/items.py: 项目中的item文件.
  • tutorial/pipelines.py: 项目中的pipelines文件.
  • tutorial/settings.py: 项目的设置文件.
  • tutorial/spiders/: 放置spider代码的目录.

定义Item

Item 是保存爬取到的数据的容器;其使用方法和python字典类似, 并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。

提取Item

Selectors选择器简介

从网页中提取数据有很多方法。Scrapy使用了一种基于 XPath 和 CSS 表达式机制: Scrapy Selectors 。 关于selector和其他提取机制的信息请参考 Selector文档 。

这里给出XPath表达式的例子及对应的含义:

  • /html/head/title: 选择HTML文档中 标签内的 </code> 元素</li> <li><code>/html/head/title/text()</code>: 选择上面提到的 <code><title></code> 元素的文字</li> <li><code>//td</code>: 选择所有的 <code><td></code> 元素</li> <li><code>//div[@class="mine"]</code>: 选择所有具有 <code>class="mine"</code> 属性的 <code>div</code> 元素</li> </ul> <hr> <p>为了配合XPath,Scrapy除了提供了 <code>Selector</code> 之外,还提供了方法来避免每次从response中提取数据时生成selector的麻烦。</p> <p>Selector有四个基本的方法(点击相应的方法可以看到详细的API文档):</p> <ul> <li><code>xpath()</code>: 传入xpath表达式,返回该表达式所对应的所有节点的selector list列表 。</li> <li><code>css()</code>: 传入CSS表达式,返回该表达式所对应的所有节点的selector list列表.</li> <li><code>extract()</code>: 序列化该节点为unicode字符串并返回list。</li> <li><code>re()</code>: 根据传入的正则表达式对数据进行提取,返回unicode字符串list列表。</li> </ul> <p>在查看了网页的源码后,您会发现网站的信息是被包含在 <em>第二个</em> <code><ul></code> 元素中。</p> <p>我们可以通过这段代码选择该页面中网站列表里所有 <code><li></code> 元素:</p> <pre><code>response.xpath('//ul/li') </code></pre> <p>网站的描述:</p> <pre><code>response.xpath('//ul/li/text()').extract() </code></pre> <p>网站的标题:</p> <pre><code>response.xpath('//ul/li/a/text()').extract() </code></pre> <p>以及网站的链接:</p> <pre><code>response.xpath('//ul/li/a/@href').extract() </code></pre> <p>之前提到过,每个 <code>.xpath()</code> 调用返回selector组成的list,因此我们可以拼接更多的 <code>.xpath()</code> 来进一步获取某个节点。我们将在下边使用这样的特性:</p> <pre><code class="prism language-python"><span class="token keyword">for</span> sel <span class="token keyword">in</span> response<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//ul/li'</span><span class="token punctuation">)</span><span class="token punctuation">:</span> title <span class="token operator">=</span> sel<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'a/text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> link <span class="token operator">=</span> sel<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'a/@href'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> desc <span class="token operator">=</span> sel<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'text()'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>extract<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span> title<span class="token punctuation">,</span> link<span class="token punctuation">,</span> desc </code></pre> <h1>mysql数据库</h1> <h3>1.启动:</h3> <pre><code class="prism language-sql">mysql <span class="token operator">-</span>u root </code></pre> <p>密码为:12345678</p> <h3>2.<strong>显示所有数据库</strong></h3> <p>输入show databases;命令,显示所有数据库</p> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> show databases<span class="token punctuation">;</span> </code></pre> <h3>3.创建数据库:</h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> create database studb<span class="token punctuation">;</span> </code></pre> <h3><strong>4. 使用数据库</strong></h3> <p>在上面显示的数据库中,实例中使用studb数据库,输入下面命令:</p> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> use studb<span class="token punctuation">;</span> </code></pre> <h3>5.创建表</h3> <pre><code class="prism language-mysql">mysql> create table test -> ( -> sid varchar(20) not null primary key, -> sname varchar(20) not null, -> sddress varchar(40) -> ); </code></pre> <h3><strong>6. 打印表结构</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> desc t_stu<span class="token punctuation">;</span> </code></pre> <p>打印结果:</p> <pre><code class="prism language-javascript"><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span> <span class="token operator">|</span> Field <span class="token operator">|</span> Type <span class="token operator">|</span> Null <span class="token operator">|</span> Key <span class="token operator">|</span> Default <span class="token operator">|</span> Extra <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span> <span class="token operator">|</span> sid <span class="token operator">|</span> <span class="token function">varchar</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token constant">NO</span> <span class="token operator">|</span> <span class="token constant">PRI</span> <span class="token operator">|</span> <span class="token constant">NULL</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token operator">|</span> sname <span class="token operator">|</span> <span class="token function">varchar</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token constant">NO</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token constant">NULL</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token operator">|</span> address <span class="token operator">|</span> <span class="token function">varchar</span><span class="token punctuation">(</span><span class="token number">50</span><span class="token punctuation">)</span> <span class="token operator">|</span> <span class="token constant">YES</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token constant">NULL</span> <span class="token operator">|</span> <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span> <span class="token number">3</span> rows <span class="token keyword">in</span> <span class="token function">set</span> <span class="token punctuation">(</span><span class="token number">0.02</span> sec<span class="token punctuation">)</span> </code></pre> <h3><strong>7. 表中增加数据</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> insert into t_stu <span class="token operator">-</span><span class="token operator">></span> select <span class="token string">'s001'</span> <span class="token punctuation">,</span> <span class="token string">'jin'</span> <span class="token punctuation">,</span> <span class="token string">'changzhou'</span> <span class="token operator">-</span><span class="token operator">></span> union <span class="token operator">-</span><span class="token operator">></span> select <span class="token string">'s002'</span> <span class="token punctuation">,</span> <span class="token string">'tom'</span> <span class="token punctuation">,</span> <span class="token string">'yangzhou'</span> <span class="token operator">-</span><span class="token operator">></span> union <span class="token operator">-</span><span class="token operator">></span> select <span class="token string">'s003'</span> <span class="token punctuation">,</span> <span class="token string">'kate'</span> <span class="token punctuation">,</span> <span class="token string">'suzhou'</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token punctuation">;</span> </code></pre> <h3><strong>8. 查看表数据</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> select <span class="token operator">*</span> <span class="token keyword">from</span> t_stu<span class="token punctuation">;</span> </code></pre> <p>查看结果:</p> <pre><code class="prism language-javascript"><span class="token operator">|</span> sid <span class="token operator">|</span> sname <span class="token operator">|</span> address <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span> <span class="token operator">|</span> s001 <span class="token operator">|</span> jin <span class="token operator">|</span> wuhan <span class="token operator">|</span> <span class="token operator">|</span> s002 <span class="token operator">|</span> tom <span class="token operator">|</span> shanghai <span class="token operator">|</span> <span class="token operator">|</span> s003 <span class="token operator">|</span> kate <span class="token operator">|</span> suzhou <span class="token operator">|</span> <span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">-</span><span class="token operator">+</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">--</span><span class="token operator">+</span> <span class="token number">3</span> rows <span class="token keyword">in</span> <span class="token function">set</span> <span class="token punctuation">(</span><span class="token number">0.01</span> sec<span class="token punctuation">)</span> </code></pre> <h3><strong>9. 修改表中数据</strong></h3> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> update t_stu <span class="token keyword">set</span> sname <span class="token operator">=</span> <span class="token string">"fby"</span> where sid <span class="token operator">=</span> <span class="token string">"s001"</span><span class="token punctuation">;</span> </code></pre> <h3><strong>10. 删除表中数据</strong></h3> <p>删除表中sid = “s002”的数据</p> <pre><code class="prism language-javascript">mysql<span class="token operator">></span> <span class="token keyword">delete</span> <span class="token keyword">from</span> t_stu where sid <span class="token operator">=</span> <span class="token string">"s002"</span><span class="token punctuation">;</span> </code></pre> <h1>读csv文件</h1> <pre><code class="prism language-python"><span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlopen <span class="token keyword">from</span> io <span class="token keyword">import</span> StringIO <span class="token keyword">import</span> csv data <span class="token operator">=</span> urlopen<span class="token punctuation">(</span><span class="token string">'http://pythonscraping.com/files/MontyPythonAlbums.csv'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">'ascii'</span><span class="token punctuation">,</span> <span class="token string">'ignore'</span><span class="token punctuation">)</span> dataFile <span class="token operator">=</span> StringIO<span class="token punctuation">(</span>data<span class="token punctuation">)</span> csvReader <span class="token operator">=</span> csv<span class="token punctuation">.</span>reader<span class="token punctuation">(</span>dataFile<span class="token punctuation">)</span> <span class="token keyword">for</span> row <span class="token keyword">in</span> csvReader<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>row<span class="token punctuation">)</span> </code></pre> <h1>Python使用pandas处理CSV文件</h1> <p>https://blog.csdn.net/atnanyang/article/details/70832257</p> <p>Python中有许多方便的库可以用来进行数据处理,尤其是Numpy和Pandas,再搭配matplot画图专用模块,功能十分强大。</p> <p>CSV(Comma-Separated Values)格式的文件是<strong>指以纯文本形式存储的表格数据</strong>,这意味着不能简单的使用Excel表格工具进行处理,而且Excel表格处理的数据量十分有限,而<strong>使用Pandas来处理数据量巨大的CSV文件</strong>就容易的多了。</p> <ul> <li><strong>Pandas读取本地CSV文件并设置Dataframe(数据格式)</strong></li> </ul> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np df<span class="token operator">=</span>pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'filename'</span><span class="token punctuation">,</span>header<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span>sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">)</span> <span class="token comment">#filename可以直接从盘符开始,标明每一级的文件夹直到csv文件,header=None表示头部为空,sep=' '表示数据间使用空格作为分隔符,如果分隔符是逗号,只需换成 ‘,’即可。</span> <span class="token keyword">print</span> df<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">print</span> df<span class="token punctuation">.</span>tail<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#作为示例,输出CSV文件的前5行和最后5行,这是pandas默认的输出5行,可以根据需要自己设定输出几行的值</span> </code></pre> <ul> <li><strong>使用pandas直接读取本地的csv文件后,csv文件的列索引默认为从0开始的数字,重定义列索引的语句如下:</strong></li> </ul> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np df<span class="token operator">=</span>pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'filename'</span><span class="token punctuation">,</span>header<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span>sep<span class="token operator">=</span><span class="token string">' '</span><span class="token punctuation">,</span>names<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"week"</span><span class="token punctuation">,</span><span class="token string">'month'</span><span class="token punctuation">,</span><span class="token string">'date'</span><span class="token punctuation">,</span><span class="token string">'time'</span><span class="token punctuation">,</span><span class="token string">'year'</span><span class="token punctuation">,</span><span class="token string">'name1'</span><span class="token punctuation">,</span><span class="token string">'freq1'</span><span class="token punctuation">,</span><span class="token string">'name2'</span><span class="token punctuation">,</span><span class="token string">'freq2'</span><span class="token punctuation">,</span><span class="token string">'name3'</span><span class="token punctuation">,</span><span class="token string">'data1'</span><span class="token punctuation">,</span><span class="token string">'name4'</span><span class="token punctuation">,</span><span class="token string">'data2'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">print</span> df </code></pre> <h2>使用pandas按列合并CSV文件</h2> <p>1.列合并两个csv文件</p> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd df1 <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'dataset/easy29.csv'</span><span class="token punctuation">)</span> df2 <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'dataset/easy210.csv'</span><span class="token punctuation">)</span> frames <span class="token operator">=</span> <span class="token punctuation">[</span>df1<span class="token punctuation">,</span> df2<span class="token punctuation">]</span> all_csv <span class="token operator">=</span> pd<span class="token punctuation">.</span>concat<span class="token punctuation">(</span>frames<span class="token punctuation">)</span> </code></pre> <p><a href="http://img.e-com-net.com/image/info8/d0cf5b8dc58a40fab1783fe6d2cfab9c.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/d0cf5b8dc58a40fab1783fe6d2cfab9c.jpg" alt="python网络爬虫_第3张图片" width="430" height="272" style="border:1px solid black;"></a></p> <p>2.通过追加的方式合并csv文件。</p> <pre><code class="prism language-python"><span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'1.csv'</span><span class="token punctuation">,</span><span class="token string">'ab'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'2.csv'</span><span class="token punctuation">,</span><span class="token string">'rb'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token comment">#将2.csv内容追加到1.csv的后面</span> </code></pre> <p>3.在将多个csv文件拼接到一起的时候,可以用Python通过pandas包的read_csv和to_csv两个方法来完成。这里不采用pandas.merge()来进行csv的拼接,而只是通过简单的文件的读取和附加方式的写入来完成拼接。</p> <p>3.1</p> <pre><code class="prism language-python"><span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">for</span> inputfile <span class="token keyword">in</span> os<span class="token punctuation">.</span>listdir<span class="token punctuation">(</span>inputfile_dir<span class="token punctuation">)</span><span class="token punctuation">:</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span>inputfile<span class="token punctuation">,</span> header<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">)</span>                    <span class="token comment">#header=None表示原始文件数据没有列索引,这样的话read_csv会自动加上列索引</span> pd<span class="token punctuation">.</span>to_csv<span class="token punctuation">(</span>outputfile<span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'a'</span><span class="token punctuation">,</span> index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> header<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span>      <span class="token comment">#header=0表示不保留列名,index=False表示不保留行索引,mode='a'表示附加方式写入,文件原有内容不会被清除</span> </code></pre> <p>3.2</p> <pre><code class="prism language-python"><span class="token comment"># 将该文件夹下的所有文件名存入列表</span> csv_name_list <span class="token operator">=</span> os<span class="token punctuation">.</span>listdir<span class="token punctuation">(</span><span class="token string">'E:\jupyternotebook_space\yimiaodatas'</span><span class="token punctuation">)</span> <span class="token comment"># 获取列表的长度</span> length <span class="token operator">=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>csv_name_list<span class="token punctuation">)</span> <span class="token comment"># 读取第一个CSV文件并包含表头,用于后续的csv文件拼接</span> f<span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>csv_name_list<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">,</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span><span class="token punctuation">)</span> df <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span> f<span class="token punctuation">)</span> <span class="token comment"># 读取第一个CSV文件并保存</span> df<span class="token punctuation">.</span>to_csv<span class="token punctuation">(</span> <span class="token string">"E:\jupyternotebook_space\Alldatas.csv"</span><span class="token punctuation">,</span>index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token comment"># 循环遍历列表中各个CSV文件名,并完成文件拼接</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span>length<span class="token punctuation">)</span><span class="token punctuation">:</span> f<span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>csv_name_list<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">,</span>encoding <span class="token operator">=</span> <span class="token string">"utf-8"</span><span class="token punctuation">)</span> df <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span> f <span class="token punctuation">)</span> df<span class="token punctuation">.</span>to_csv<span class="token punctuation">(</span><span class="token string">"E:\jupyternotebook_space\Alldatas.csv"</span><span class="token punctuation">,</span>index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> header<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> mode<span class="token operator">=</span><span class="token string">'a+'</span><span class="token punctuation">)</span> </code></pre> <h2>pandas在dataframe最左侧新增一个自增列</h2> <p>有如下表格,需要在最左侧新增一列为“序号”,编号从1开始</p> <p><a href="http://img.e-com-net.com/image/info8/ce467e8caf9b4bc5bd1bd1ea4d3fefea.png" target="_blank"><img src="http://img.e-com-net.com/image/info8/ce467e8caf9b4bc5bd1bd1ea4d3fefea.png" alt="python网络爬虫_第4张图片" width="382" height="370" style="border:1px solid black;"></a></p> <p>代码如下:</p> <pre><code class="prism language-python"><span class="token comment">#打开文件</span> <span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd df <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_excel<span class="token punctuation">(</span><span class="token string">r'test.xlsx'</span><span class="token punctuation">)</span> <span class="token comment">#序号列为从1开始的自增列,默认加在dataframe最右侧</span> df<span class="token punctuation">[</span><span class="token string">'序号'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span><span class="token builtin">len</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span><span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment">#对原始列重新排序,使自增列位于最左侧</span> df <span class="token operator">=</span> df<span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token string">'序号'</span><span class="token punctuation">,</span><span class="token string">'seats'</span><span class="token punctuation">,</span><span class="token string">'price'</span><span class="token punctuation">,</span><span class="token string">'price-sign'</span><span class="token punctuation">]</span><span class="token punctuation">]</span> <span class="token comment">#输出</span> df<span class="token punctuation">.</span>to_excel<span class="token punctuation">(</span><span class="token string">'test_new.xlsx'</span><span class="token punctuation">,</span>index<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> </code></pre> <p><a href="http://img.e-com-net.com/image/info8/381bd055cd514f90b68d5f2328a8e72a.png" target="_blank"><img src="http://img.e-com-net.com/image/info8/381bd055cd514f90b68d5f2328a8e72a.png" alt="python网络爬虫_第5张图片" width="480" height="372" style="border:1px solid black;"></a></p> <h1>爬取github项目的issues</h1> <h5>lxml中etree.HTML()和etree.tostring()用法</h5> <p>https://blog.csdn.net/qq_38410428/article/details/82792730</p> <ul> <li>etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。(修复html文件中代码,把缺的头或尾节点补齐;)</li> <li>etree.tostring():输出修正后的结果,类型是bytes</li> </ul> <pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">import</span> requests <span class="token comment"># 根据关键词获取项目列表</span> <span class="token keyword">def</span> <span class="token function">get_repos_list</span><span class="token punctuation">(</span>key_words<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 初始化列表</span> repos_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment"># 默认</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">)</span><span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">'https://github.com/search?p='</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">'&q='</span> <span class="token operator">+</span> key_words <span class="token operator">+</span> <span class="token string">'&type=repositories'</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取页面源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text <span class="token comment"># print(page_source)</span> <span class="token comment">#etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。yyy</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取项目超链接</span> arr <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@class="f4 text-normal"]/a/@href'</span><span class="token punctuation">)</span> repos_list <span class="token operator">+=</span> arr <span class="token keyword">return</span> repos_list <span class="token comment"># 获取一个项目的issues列表</span> <span class="token keyword">def</span> <span class="token function">get_issues_list</span><span class="token punctuation">(</span>repo_name<span class="token punctuation">)</span><span class="token punctuation">:</span> issues_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> repo_name <span class="token operator">+</span> <span class="token string">'/issues'</span> <span class="token comment"># print(url)</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取issues数量</span> number <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@id="js-repo-pjax-container"]/div[1]/nav/ul/li[2]/a/span[2]'</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> number <span class="token operator">=</span> <span class="token string">'0'</span> <span class="token keyword">else</span><span class="token punctuation">:</span> number <span class="token operator">=</span> number<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>text <span class="token comment"># 超过1K就爬取1000条(够用了)</span> <span class="token keyword">if</span> number<span class="token punctuation">.</span>isdigit<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> number <span class="token operator">=</span> <span class="token builtin">int</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> number <span class="token operator">=</span> <span class="token number">1000</span> <span class="token keyword">print</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span> <span class="token comment"># 计算分页数量,每页25个issues</span> page <span class="token operator">=</span> <span class="token number">0</span> <span class="token keyword">if</span> number <span class="token operator">%</span> <span class="token number">25</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> page <span class="token operator">=</span> <span class="token builtin">int</span><span class="token punctuation">(</span>number <span class="token operator">/</span> <span class="token number">25</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> page <span class="token operator">=</span> <span class="token builtin">int</span><span class="token punctuation">(</span>number <span class="token operator">/</span> <span class="token number">25</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token number">1</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> page <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">:</span> url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> repo_name <span class="token operator">+</span> <span class="token string">'/issues?page='</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>i<span class="token punctuation">)</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取issues超链接</span> arr <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//*[@class="d-block d-md-none position-absolute top-0 bottom-0 left-0 right-0"]/@href'</span><span class="token punctuation">)</span> issues_list <span class="token operator">+=</span> arr <span class="token comment"># /combust/mleap/issues/716</span> <span class="token comment"># 返回issues数量和列表</span> <span class="token keyword">return</span> number<span class="token punctuation">,</span> issues_list <span class="token comment"># 获取一个issue的内容及评论</span> <span class="token keyword">def</span> <span class="token function">get_issue_content</span><span class="token punctuation">(</span>issue_name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 拼接issue地址</span> url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> issue_name <span class="token comment"># print(url)</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> <span class="token comment"># 获取issue内容</span> issue_content <span class="token operator">=</span> tree<span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'//table//td'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>xpath<span class="token punctuation">(</span><span class="token string">'string(.)'</span><span class="token punctuation">)</span> <span class="token keyword">return</span> issue_content <span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span> <span class="token comment"># 测试</span> <span class="token comment"># get_repos_list('ML pipeline')</span> <span class="token comment"># get_issues('/combust/mleap')</span> <span class="token comment"># get_issue_content('/combust/mleap/issues/716')</span> <span class="token triple-quoted-string string">''' issue="/rust-lang/rust/issues/76833" content=get_issue_content(issue) print(content) '''</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">r'result.md'</span><span class="token punctuation">,</span> <span class="token string">'w+'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> key_words <span class="token operator">=</span> <span class="token builtin">input</span><span class="token punctuation">(</span><span class="token string">'please input a keyword:'</span><span class="token punctuation">)</span> <span class="token comment"># 获取项目列表</span> repos_list <span class="token operator">=</span> get_repos_list<span class="token punctuation">(</span>key_words<span class="token punctuation">)</span> <span class="token comment"># 格式:/combust/mleap</span> <span class="token keyword">for</span> repo <span class="token keyword">in</span> repos_list<span class="token punctuation">:</span> <span class="token comment"># 拼接项目url</span> repos_url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> repo <span class="token keyword">print</span><span class="token punctuation">(</span>repos_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>repos_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> <span class="token comment"># 获取项目的issues列表</span> number<span class="token punctuation">,</span> issues_list <span class="token operator">=</span> get_issues_list<span class="token punctuation">(</span>repo<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>number<span class="token punctuation">)</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> <span class="token comment"># 格式:/combust/mleap/issues/716</span> <span class="token keyword">for</span> issue <span class="token keyword">in</span> issues_list<span class="token punctuation">:</span> <span class="token comment"># 获取issue的内容</span> issue_url <span class="token operator">=</span> <span class="token string">'https://github.com'</span> <span class="token operator">+</span> issue content <span class="token operator">=</span> get_issue_content<span class="token punctuation">(</span>issue<span class="token punctuation">)</span> <span class="token comment"># content=filter_emoji(content)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>issue_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>issue_url<span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'>'</span> <span class="token operator">*</span> <span class="token number">100</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token builtin">str</span><span class="token punctuation">(</span>content<span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'<'</span> <span class="token operator">*</span> <span class="token number">100</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span> f<span class="token punctuation">.</span>flush<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># print(content)</span> <span class="token comment"># print(issue)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'The end!'</span><span class="token punctuation">)</span> </code></pre> <hr> <h1>爬commit信息</h1> <h3>获取commit每一页的网址url</h3> <pre><code class="prism language-python"><span class="token keyword">import</span> re <span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlopen <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">from</span> urllib <span class="token keyword">import</span> request <span class="token keyword">import</span> time <span class="token keyword">import</span> os <span class="token keyword">from</span> urllib<span class="token punctuation">.</span>parse <span class="token keyword">import</span> urlparse <span class="token triple-quoted-string string">''' 获取了每一页的网址 接下来:爬取每一页内的历史commit信息,包括具体的commit_url 、时间等 '''</span> <span class="token comment"># 请求函数</span> <span class="token keyword">def</span> <span class="token function">get_html</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> req <span class="token operator">=</span> request<span class="token punctuation">.</span>Request<span class="token punctuation">(</span>url<span class="token punctuation">)</span> response <span class="token operator">=</span> request<span class="token punctuation">.</span>urlopen<span class="token punctuation">(</span>req<span class="token punctuation">)</span> html <span class="token operator">=</span> response<span class="token punctuation">.</span>read<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token string">'utf-8'</span><span class="token punctuation">)</span> <span class="token keyword">return</span> html <span class="token keyword">def</span> <span class="token function">get_sha</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> repo_name<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 用户的每个repo对应一个commit sha</span> url <span class="token operator">=</span> <span class="token string">"https://github.com/{user}/{repo_name}/commits/master"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>user<span class="token operator">=</span>user<span class="token punctuation">,</span> repo_name<span class="token operator">=</span>repo_name<span class="token punctuation">)</span> html<span class="token operator">=</span>urlopen<span class="token punctuation">(</span>url<span class="token punctuation">)</span> bs<span class="token operator">=</span>BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span><span class="token string">'html.parser'</span><span class="token punctuation">)</span> link<span class="token operator">=</span>bs<span class="token punctuation">.</span>find<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">,</span>href<span class="token operator">=</span>re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">"https://github.com/.*commit/(.*?)"</span><span class="token punctuation">)</span><span class="token punctuation">)</span> commit_url<span class="token operator">=</span>link<span class="token punctuation">.</span>attrs<span class="token punctuation">[</span><span class="token string">'href'</span><span class="token punctuation">]</span> <span class="token comment">#print(type(commit_url)) <class 'str'></span> <span class="token comment">#print(commit_url)</span> <span class="token comment">#req=urlparse(commit_url)</span> <span class="token comment">#print(req)</span> list_commit<span class="token operator">=</span>commit_url<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">)</span> <span class="token comment">#print(list_commit[6]) (第6个元素才是对应的哈希值)</span> commit_sha<span class="token operator">=</span>list_commit<span class="token punctuation">[</span><span class="token number">6</span><span class="token punctuation">]</span> <span class="token comment">#print(commit_sha)</span> <span class="token keyword">return</span> commit_sha <span class="token keyword">def</span> <span class="token function">single_repo_commits</span><span class="token punctuation">(</span>user<span class="token punctuation">,</span> repo_name<span class="token punctuation">)</span><span class="token punctuation">:</span> num <span class="token operator">=</span> <span class="token number">0</span> page_flag <span class="token operator">=</span> <span class="token number">66</span> <span class="token comment"># 设置页面初始标志,用于判断是否到达末页</span> page_num <span class="token operator">=</span> <span class="token number">0</span> data_num <span class="token operator">=</span> <span class="token number">0</span> commit_sha <span class="token operator">=</span> get_sha<span class="token punctuation">(</span>user<span class="token punctuation">,</span> repo_name<span class="token punctuation">)</span> all_date <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment"># 储存时间数据</span> url_data<span class="token operator">=</span><span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment">#存储每页的网址</span> <span class="token keyword">while</span> <span class="token punctuation">(</span>page_flag <span class="token keyword">and</span> page_num<span class="token operator"><</span><span class="token number">5</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 测试前五页</span> url <span class="token operator">=</span> <span class="token string">"https://github.com/{user}/{repo_name}/commits/master?after={commit_sha}+{num}&branch=master"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>user<span class="token operator">=</span>user<span class="token punctuation">,</span> repo_name<span class="token operator">=</span>repo_name<span class="token punctuation">,</span> commit_sha<span class="token operator">=</span>commit_sha<span class="token punctuation">,</span> num<span class="token operator">=</span>num<span class="token punctuation">)</span> <span class="token comment"># 构建链接</span> html <span class="token operator">=</span> get_html<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取页面内容</span> url_data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment">#每一页的url,然后接下来在这页开始搜索commit_url和提交时间</span> time_data <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span><span class="token string">r'<relative-time datetime=(.*)</relative-time>'</span><span class="token punctuation">,</span>html<span class="token punctuation">)</span> <span class="token comment"># re匹配时间元素</span> <span class="token comment">#page_flag = len(time_data)</span> page_num <span class="token operator">=</span> page_num <span class="token operator">+</span> <span class="token number">1</span> num <span class="token operator">=</span> num <span class="token operator">+</span><span class="token number">35</span> <span class="token comment"># 进入下一页</span> data_num <span class="token operator">=</span> data_num<span class="token operator">+</span><span class="token builtin">len</span><span class="token punctuation">(</span>time_data<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"page %d is ok\n get %d date"</span> <span class="token operator">%</span> <span class="token punctuation">(</span>page_num<span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>time_data<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment">#print(time_data[0]) 可查看第一个time_data元素的完整输出</span> <span class="token keyword">for</span> date <span class="token keyword">in</span> time_data<span class="token punctuation">:</span> all_date<span class="token punctuation">.</span>append<span class="token punctuation">(</span>date<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">:</span><span class="token number">20</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment">#1:20是日期的内容,之后是其他属性 </span> time<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token comment"># 适当延时一下 单位:s</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"the repo <%s> totally get %d commits'date"</span> <span class="token operator">%</span> <span class="token punctuation">(</span>repo_name<span class="token punctuation">,</span> data_num<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>url_data<span class="token punctuation">)</span> <span class="token keyword">return</span> all_date user<span class="token operator">=</span><span class="token string">'chakra-core'</span> repo_name<span class="token operator">=</span><span class="token string">'ChakraCore'</span> <span class="token comment">#get_sha(user,repo_name)</span> all_data<span class="token operator">=</span>single_repo_commits<span class="token punctuation">(</span>user<span class="token punctuation">,</span>repo_name<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>all_data<span class="token punctuation">)</span> </code></pre> <h3>get_data函数获取指定页面的全部commit_url</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlopen <span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup <span class="token keyword">import</span> re <span class="token triple-quoted-string string">''' get_data函数获取指定页面的全部commit_url 接下来要做的是:如何搜索提交的内容:title、issue?等,是否存储为excel? '''</span> <span class="token keyword">def</span> <span class="token function">get_data</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span> headers <span class="token operator">=</span> <span class="token punctuation">{</span> <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'</span><span class="token punctuation">}</span> html <span class="token operator">=</span> urlopen<span class="token punctuation">(</span>url<span class="token punctuation">)</span> baseurl <span class="token operator">=</span> <span class="token string">'https://github.com'</span> bs <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">'html.parser'</span><span class="token punctuation">)</span> pages <span class="token operator">=</span> <span class="token builtin">set</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment"># 不存在重复</span> <span class="token comment"># print(bs.contents)</span> commit_url <span class="token operator">=</span> bs<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">,</span> href<span class="token operator">=</span>re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">'^(/chakra-core/ChakraCore/commit/).*$'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># print(commit_url)</span> fp<span class="token operator">=</span><span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'commit_url.txt'</span><span class="token punctuation">,</span> <span class="token string">'w+'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> link <span class="token keyword">in</span> commit_url<span class="token punctuation">:</span> <span class="token keyword">if</span> <span class="token string">'href'</span> <span class="token keyword">in</span> link<span class="token punctuation">.</span>attrs<span class="token punctuation">:</span> <span class="token keyword">if</span> link<span class="token punctuation">.</span>attrs<span class="token punctuation">[</span><span class="token string">'href'</span><span class="token punctuation">]</span> <span class="token keyword">not</span> <span class="token keyword">in</span> pages<span class="token punctuation">:</span> <span class="token comment"># 我们遇到了新页面</span> newPage <span class="token operator">=</span> link<span class="token punctuation">.</span>attrs<span class="token punctuation">[</span><span class="token string">'href'</span><span class="token punctuation">]</span> pages<span class="token punctuation">.</span>add<span class="token punctuation">(</span>newPage<span class="token punctuation">)</span> fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span>newPage<span class="token punctuation">)</span> <span class="token comment"># 将字符串写入文件中</span> fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">"\n"</span><span class="token punctuation">)</span> <span class="token comment"># 换行</span> <span class="token keyword">print</span><span class="token punctuation">(</span>newPage<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">)</span> fp<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> get_data<span class="token punctuation">(</span><span class="token string">'https://github.com/chakra-core/ChakraCore/commits/master'</span><span class="token punctuation">)</span> </code></pre> <h1>读取文件</h1> <ol> <li>从文件members.txt中以字典形式读取数据,名字作为键,年龄作为值。文件中的内容如下,以制表符(’\t’)分隔数据</li> </ol> <pre><code class="prism language-python">content <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">'members.txt'</span><span class="token punctuation">,</span> <span class="token string">'r'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span> <span class="token keyword">for</span> line <span class="token keyword">in</span> f<span class="token punctuation">.</span>readlines<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> line_list <span class="token operator">=</span> line<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">'\n'</span><span class="token punctuation">)</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'\t'</span><span class="token punctuation">)</span> <span class="token comment"># 去除换行符,以制表符分隔</span> content<span class="token punctuation">.</span>append<span class="token punctuation">(</span>line_list<span class="token punctuation">)</span> keys <span class="token operator">=</span> content<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>content<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> content_dict <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span> <span class="token keyword">for</span> k<span class="token punctuation">,</span> v <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>keys<span class="token punctuation">,</span> content<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span> content_dict<span class="token punctuation">[</span>k<span class="token punctuation">]</span> <span class="token operator">=</span> v <span class="token keyword">print</span><span class="token punctuation">(</span>content_dict<span class="token punctuation">)</span> <span class="token triple-quoted-string string">''' result: {'Name': 'Andy', 'age': '32'} {'Name': 'Bob', 'age': '20'} {'Name': 'Jenny', 'age': '43'} {'Name': 'Holly', 'age': '48'} {'Name': 'Danie', 'age': '27'} '''</span> </code></pre> <h1>函数的意思</h1> <h3>etree.HTML(), etree.tostring()</h3> <pre><code class="prism language-python"><span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">import</span> requests url <span class="token operator">=</span> <span class="token string">'https://github.com/chakra-core/ChakraCore/issues'</span> response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">)</span> <span class="token comment"># 获取页面源码</span> page_source <span class="token operator">=</span> response<span class="token punctuation">.</span>text <span class="token comment"># print(page_source)</span> tree <span class="token operator">=</span> etree<span class="token punctuation">.</span>HTML<span class="token punctuation">(</span>page_source<span class="token punctuation">)</span> result<span class="token operator">=</span>etree<span class="token punctuation">.</span>tostring<span class="token punctuation">(</span>tree<span class="token punctuation">)</span> <span class="token comment">#etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。</span> <span class="token comment">#etree.tostring():输出修正后的结果,类型是bytes</span> </code></pre> <h1>路飞学城爬虫教程</h1> <h3>第一章 爬虫基础介绍</h3> <pre><code class="prism language-text">爬虫究竟是合法还是违法的? - 在法律中是不被禁止 - 具有违法风险 - 善意爬虫 恶意爬虫 爬虫带来的风险可以体现在如下2方面: - 爬虫干扰了被访问网站的正常运营 - 爬虫抓取了收到法律保护的特定类型的数据或信息 如何在使用编写爬虫的过程中避免进入局子的厄运呢? - 时常的优化自己的程序,避免干扰被访问网站的正常运行 - 在使用,传播爬取到的数据时,审查抓取到的内容,如果发现了涉及到用户隐私 商业机密等敏感内容需要及时停止爬取或传播 爬虫在使用场景中的分类 - 通用爬虫: 抓取系统重要组成部分。抓取的是一整张页面数据。 - 聚焦爬虫: 是建立在通用爬虫的基础之上。抓取的是页面中特定的局部内容。 - 增量式爬虫: 检测网站中数据更新的情况。只会抓取网站中最新更新出来的数据。 爬虫的矛与盾 反爬机制 门户网站,可以通过制定相应的策略或者技术手段,防止爬虫程序进行网站数据的爬取。 反反爬策略 爬虫程序可以通过制定相关的策略或者技术手段,破解门户网站中具备的反爬机制,从而可以获取门户网站中相关的数据。 robots.txt协议: 君子协议。规定了网站中哪些数据可以被爬虫爬取哪些数据不可以被爬取。 http协议 - 概念:就是服务器和客户端进行数据交互的一种形式。 常用请求头信息 - User-Agent:请求载体的身份标识 - Connection:请求完毕后,是断开连接还是保持连接 常用响应头信息 - Content-Type:服务器响应回客户端的数据类型 https协议: - 安全的超文本传输协议 加密方式 - 对称秘钥加密 - 非对称秘钥加密 - 证书秘钥加密 </code></pre> <h3>第二章 requests基础模块</h3> <pre><code>requests模块 - urllib模块 - requests模块 requests模块:python中原生的一款基于网络请求的模块,功能非常强大,简单便捷,效率极高。 作用:模拟浏览器发请求。 如何使用:(requests模块的编码流程) - 指定url - UA伪装 - 请求参数的处理 - 发起请求 - 获取响应数据 - 持久化存储 环境安装: pip install requests 实战编码: - 需求:爬取搜狗首页的页面数据 实战巩固 - 需求:爬取搜狗指定词条对应的搜索结果页面(简易网页采集器) - UA检测 - UA伪装 - 需求:破解百度翻译 - post请求(携带了参数) - 响应数据是一组json数据 - 需求:爬取豆瓣电影分类排行榜 https://movie.douban.com/中的电影详情数据 - 作业:爬取肯德基餐厅查询http://www.kfc.com.cn/kfccda/index.aspx中指定地点的餐厅数据 - 需求:爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据 http://125.35.6.84:81/xk/ - 动态加载数据 - 首页中对应的企业信息数据是通过ajax动态请求到的。 http://125.35.6.84:81/xk/itownet/portal/dzpz.jsp?id=e6c1aa332b274282b04659a6ea30430a http://125.35.6.84:81/xk/itownet/portal/dzpz.jsp?id=f63f61fe04684c46a016a45eac8754fe - 通过对详情页url的观察发现: - url的域名都是一样的,只有携带的参数(id)不一样 - id值可以从首页对应的ajax请求到的json串中获取 - 域名和id值拼接处一个完整的企业对应的详情页的url - 详情页的企业详情数据也是动态加载出来的 - http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById - http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById - 观察后发现: - 所有的post请求的url都是一样的,只有参数id值是不同。 - 如果我们可以批量获取多家企业的id后,就可以将id和url形成一个完整的详情页对应详情数据的ajax请求的url 数据解析: 聚焦爬虫 正则 bs4 xpath </code></pre> <h3>第三章 数据解析</h3> <pre><code>聚焦爬虫:爬取页面中指定的页面内容。 - 编码流程: - 指定url - 发起请求 - 获取响应数据 - 数据解析 - 持久化存储 数据解析分类: - 正则 - bs4 - xpath(***) 数据解析原理概述: - 解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储 - 1.进行指定标签的定位 - 2.标签或者标签对应的属性中存储的数据值进行提取(解析) 正则解析: <div class="thumb"> <a href="/article/121721100" target="_blank"> <img src="//pic.qiushibaike.com/system/pictures/12172/121721100/medium/DNXDX9TZ8SDU6OK2.jpg" alt="指引我有前进的方向"> </a> </div> ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>' bs4进行数据解析 - 数据解析的原理: - 1.标签定位 - 2.提取标签、标签属性中存储的数据值 - bs4数据解析的原理: - 1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中 - 2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取 - 环境安装: - pip install bs4 - pip install lxml - 如何实例化BeautifulSoup对象: - from bs4 import BeautifulSoup - 对象的实例化: - 1.将本地的html文档中的数据加载到该对象中 fp = open('./test.html','r',encoding='utf-8') soup = BeautifulSoup(fp,'lxml') - 2.将互联网上获取的页面源码加载到该对象中 page_text = response.text soup = BeatifulSoup(page_text,'lxml') - 提供的用于数据解析的方法和属性: - soup.tagName:返回的是文档中第一次出现的tagName对应的标签 - soup.find(): - find('tagName'):等同于soup.div - 属性定位: -soup.find('div',class_/id/attr='song') - soup.find_all('tagName'):返回符合要求的所有标签(列表) - select: - select('某种选择器(id,class,标签...选择器)'),返回的是一个列表。 - 层级选择器: - soup.select('.tang > ul > li > a'):>表示的是一个层级 - oup.select('.tang > ul a'):空格表示的多个层级 - 获取标签之间的文本数据: - soup.a.text/string/get_text() - text/get_text():可以获取某一个标签中所有的文本内容 - string:只可以获取该标签下面直系的文本内容 - 获取标签中属性值: - soup.a['href'] xpath解析:最常用且最便捷高效的一种解析方式。通用性。 - xpath解析原理: - 1.实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中。 - 2.调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。 - 环境的安装: - pip install lxml - 如何实例化一个etree对象:from lxml import etree - 1.将本地的html文档中的源码数据加载到etree对象中: etree.parse(filePath) - 2.可以将从互联网上获取的源码数据加载到该对象中 etree.HTML('page_text') - xpath('xpath表达式') - xpath表达式: - /:表示的是从根节点开始定位。表示的是一个层级。 - //:表示的是多个层级。可以表示从任意位置开始定位。 - 属性定位://div[@class='song'] tag[@attrName="attrValue"] - 索引定位://div[@class="song"]/p[3] 索引是从1开始的。 - 取文本: - /text() 获取的是标签中直系的文本内容 - //text() 标签中非直系的文本内容(所有的文本内容) - 取属性: /@attrName ==>img/src 作业: 爬取站长素材中免费简历模板 </code></pre> <h3>第四章 验证码</h3> <pre><code>验证码识别 验证码和爬虫之间的爱恨情仇? 反爬机制:验证码.识别验证码图片中的数据,用于模拟登陆操作。 识别验证码的操作: - 人工肉眼识别。(不推荐) - 第三方自动识别(推荐) - 云打码:http://www.yundama.com/demo.html 云打码的使用流程: - 注册:普通和开发者用户 - 登录: - 普通用户的登录:查询该用户是否还有剩余的题分 - 开发者用户的登录: - 创建一个软件: 我的软件-》添加新软件-》录入软件名称-》提交(软件id和秘钥) - 下载示例代码:开发文档-》点此下载:云打码接口DLL-》PythonHTTP示例下载 实战:识别古诗文网登录页面中的验证码。 使用打码平台识别验证码的编码流程: - 将验证码图片进行本地下载 - 调用平台提供的示例代码进行图片数据识别 </code></pre> <h3>第五章 requests模块高级</h3> <pre><code>模拟登录: - 爬取基于某些用户的用户信息。 需求:对人人网进行模拟登录。 - 点击登录按钮之后会发起一个post请求 - post请求中会携带登录之前录入的相关的登录信息(用户名,密码,验证码......) - 验证码:每次请求都会变化 需求:爬取当前用户的相关的用户信息(个人主页中显示的用户信息) http/https协议特性:无状态。 没有请求到对应页面数据的原因: 发起的第二次基于个人主页页面请求的时候,服务器端并不知道该此请求是基于登录状态下的请求。 cookie:用来让服务器端记录客户端的相关状态。 - 手动处理:通过抓包工具获取cookie值,将该值封装到headers中。(不建议) - 自动处理: - cookie值的来源是哪里? - 模拟登录post请求后,由服务器端创建。 session会话对象: - 作用: 1.可以进行请求的发送。 2.如果请求过程中产生了cookie,则该cookie会被自动存储/携带在该session对象中。 - 创建一个session对象:session = requests.Session() - 使用session对象进行模拟登录post请求的发送(cookie就会被存储在session中) - session对象对个人主页对应的get请求进行发送(携带了cookie) 代理:破解封IP这种反爬机制。 什么是代理: - 代理服务器。 代理的作用: - 突破自身IP访问的限制。 - 隐藏自身真实IP 代理相关的网站: - 快代理 - 西祠代理 - www.goubanjia.com 代理ip的类型: - http:应用到http协议对应的url中 - https:应用到https协议对应的url中 代理ip的匿名度: - 透明:服务器知道该次请求使用了代理,也知道请求对应的真实ip - 匿名:知道使用了代理,不知道真实ip - 高匿:不知道使用了代理,更不知道真实的ip </code></pre> <h3>第六章 高性能异步爬虫</h3> <pre><code>高性能异步爬虫 目的:在爬虫中使用异步实现高性能的数据爬取操作。 异步爬虫的方式: - 1.多线程,多进程(不建议): 好处:可以为相关阻塞的操作单独开启线程或者进程,阻塞操作就可以异步执行。 弊端:无法无限制的开启多线程或者多进程。 - 2.线程池、进程池(适当的使用): 好处:我们可以降低系统对进程或者线程创建和销毁的一个频率,从而很好的降低系统的开销。 弊端:池中线程或进程的数量是有上限。 - 3.单线程+异步协程(推荐): event_loop:事件循环,相当于一个无限循环,我们可以把一些函数注册到这个事件循环上, 当满足某些条件的时候,函数就会被循环执行。 coroutine:协程对象,我们可以将协程对象注册到事件循环中,它会被事件循环调用。 我们可以使用 async 关键字来定义一个方法,这个方法在调用时不会立即被执行,而是返回 一个协程对象。 task:任务,它是对协程对象的进一步封装,包含了任务的各个状态。 future:代表将来执行或还没有执行的任务,实际上和 task 没有本质区别。 async 定义一个协程. await 用来挂起阻塞方法的执行。 </code></pre> <h3>第七章 动态加载数据处理</h3> <pre><code>selenium模块的基本使用 问题:selenium模块和爬虫之间具有怎样的关联? - 便捷的获取网站中动态加载的数据 - 便捷实现模拟登录 什么是selenium模块? - 基于浏览器自动化的一个模块。 selenium使用流程: - 环境安装:pip install selenium - 下载一个浏览器的驱动程序(谷歌浏览器) - 下载路径:http://chromedriver.storage.googleapis.com/index.html - 驱动程序和浏览器的映射关系:http://blog.csdn.net/huilan_same/article/details/51896672 - 实例化一个浏览器对象 - 编写基于浏览器自动化的操作代码 - 发起请求:get(url) - 标签定位:find系列的方法 - 标签交互:send_keys('xxx') - 执行js程序:excute_script('jsCode') - 前进,后退:back(),forward() - 关闭浏览器:quit() - selenium处理iframe - 如果定位的标签存在于iframe标签之中,则必须使用switch_to.frame(id) - 动作链(拖动):from selenium.webdriver import ActionChains - 实例化一个动作链对象:action = ActionChains(bro) - click_and_hold(div):长按且点击操作 - move_by_offset(x,y) - perform()让动作链立即执行 - action.release()释放动作链对象 12306模拟登录 - 超级鹰:http://www.chaojiying.com/about.html - 注册:普通用户 - 登录:普通用户 - 题分查询:充值 - 创建一个软件(id) - 下载示例代码 - 12306模拟登录编码流程: - 使用selenium打开登录页面 - 对当前selenium打开的这张页面进行截图 - 对当前图片局部区域(验证码图片)进行裁剪 - 好处:将验证码图片和模拟登录进行一一对应。 - 使用超级鹰识别验证码图片(坐标) - 使用动作链根据坐标实现点击操作 - 录入用户名密码,点击登录按钮实现登录 </code></pre> <h3>第八章 scrapy框架</h3> <pre><code>scrapy框架 - 什么是框架? - 就是一个集成了很多功能并且具有很强通用性的一个项目模板。 - 如何学习框架? - 专门学习框架封装的各种功能的详细用法。 - 什么是scrapy? - 爬虫中封装好的一个明星框架。功能:高性能的持久化存储,异步的数据下载,高性能的数据解析,分布式 - scrapy框架的基本使用 - 环境的安装: - mac or linux:pip install scrapy - windows: - pip install wheel - 下载twisted,下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted - 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl - pip install pywin32 - pip install scrapy 测试:在终端里录入scrapy指令,没有报错即表示安装成功! - 创建一个工程:scrapy startproject xxxPro - cd xxxPro - 在spiders子目录中创建一个爬虫文件 - scrapy genspider spiderName www.xxx.com - 执行工程: - scrapy crawl spiderName - scrapy数据解析 - scrapy持久化存储 - 基于终端指令: - 要求:只可以将parse方法的返回值存储到本地的文本文件中 - 注意:持久化存储对应的文本文件的类型只可以为:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle - 指令:scrapy crawl xxx -o filePath - 好处:简介高效便捷 - 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中) https://www.bilibili.com/video/BV1ha4y1H7sx?p=64&spm_id_from=pageDriver - 基于管道: - 编码流程: - 数据解析 - 在item类中定义相关的属性 - 将解析的数据封装存储到item类型的对象 - 将item类型的对象提交给管道进行持久化存储的操作 - 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作 - 在配置文件中开启管道 - 好处: - 通用性强。 - 面试题:将爬取到的数据一份存储到本地一份存储到数据库,如何实现? - 管道文件中一个管道类对应的是将数据存储到一种平台 - 爬虫文件提交的item只会给管道文件中第一个被执行的管道类接受 - process_item中的return item表示将item传递给下一个即将被执行的管道类 - 基于Spider的全站数据爬取 - 就是将网站中某板块下的全部页码对应的页面数据进行爬取 - 需求:爬取校花网中的照片的名称 - 实现方式: - 将所有页面的url添加到start_urls列表(不推荐) - 自行手动进行请求发送(推荐) - 手动请求发送: - yield scrapy.Request(url,callback):callback专门用做于数据解析 - 五大核心组件 引擎(Scrapy) 用来处理整个系统的数据流处理, 触发事务(框架核心) 调度器(Scheduler) 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 下载器(Downloader) 用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的) 爬虫(Spiders) 爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面 项目管道(Pipeline) 负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。 - 请求传参 - 使用场景:如果爬取解析的数据不在同一张页面中。(深度爬取) - 需求:爬取boss的岗位名称,岗位描述 - 图片数据爬取之ImagesPipeline - 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别? - 字符串:只需要基于xpath进行解析且提交管道进行持久化存储 - 图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据 - ImagesPipeline: - 只需要将img的src的属性值进行解析,提交到管道,管道就会对图片的src进行请求发送获取图片的二进制类型的数据,且还会帮我们进行持久化存储。 - 需求:爬取站长素材中的高清图片 - 使用流程: - 数据解析(图片的地址) - 将存储图片地址的item提交到制定的管道类 - 在管道文件中自定制一个基于ImagesPipeLine的一个管道类 - get_media_request - file_path - item_completed - 在配置文件中: - 指定图片存储的目录:IMAGES_STORE = './imgs_bobo' - 指定开启的管道:自定制的管道类 - 中间件 - 下载中间件 - 位置:引擎和下载器之间 - 作用:批量拦截到整个工程中所有的请求和响应 - 拦截请求: - UA伪装:process_request - 代理IP:process_exception:return request - 拦截响应: - 篡改响应数据,响应对象 - 需求:爬取网易新闻中的新闻数据(标题和内容) - 1.通过网易新闻的首页解析出五大板块对应的详情页的url(没有动态加载) - 2.每一个板块对应的新闻标题都是动态加载出来的(动态加载) - 3.通过解析出每一条新闻详情页的url获取详情页的页面源码,解析出新闻内容 - CrawlSpider:类,Spider的一个子类 - 全站数据爬取的方式 - 基于Spider:手动请求 - 基于CrawlSpider - CrawlSpider的使用: - 创建一个工程 - cd XXX - 创建爬虫文件(CrawlSpider): - scrapy genspider -t crawl xxx www.xxxx.com - 链接提取器: - 作用:根据指定的规则(allow)进行指定链接的提取 - 规则解析器: - 作用:将链接提取器提取到的链接进行指定规则(callback)的解析 #需求:爬取sun网站中的编号,新闻标题,新闻内容,标号 - 分析:爬取的数据没有在同一张页面中。 - 1.可以使用链接提取器提取所有的页码链接 - 2.让链接提取器提取所有的新闻详情页的链接 - 分布式爬虫 - 概念:我们需要搭建一个分布式的机群,让其对一组资源进行分布联合爬取。 - 作用:提升爬取数据的效率 - 如何实现分布式? - 安装一个scrapy-redis的组件 - 原生的scarapy是不可以实现分布式爬虫,必须要让scrapy结合着scrapy-redis组件一起实现分布式爬虫。 - 为什么原生的scrapy不可以实现分布式? - 调度器不可以被分布式机群共享 - 管道不可以被分布式机群共享 - scrapy-redis组件作用: - 可以给原生的scrapy框架提供可以被共享的管道和调度器 - 实现流程 - 创建一个工程 - 创建一个基于CrawlSpider的爬虫文件 - 修改当前的爬虫文件: - 导包:from scrapy_redis.spiders import RedisCrawlSpider - 将start_urls和allowed_domains进行注释 - 添加一个新属性:redis_key = 'sun' 可以被共享的调度器队列的名称 - 编写数据解析相关的操作 - 将当前爬虫类的父类修改成RedisCrawlSpider - 修改配置文件settings - 指定使用可以被共享的管道: ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } - 指定调度器: # 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据 SCHEDULER_PERSIST = True - 指定redis服务器: - redis相关操作配置: - 配置redis的配置文件: - linux或者mac:redis.conf - windows:redis.windows.conf - 代开配置文件修改: - 将bind 127.0.0.1进行删除 - 关闭保护模式:protected-mode yes改为no - 结合着配置文件开启redis服务 - redis-server 配置文件 - 启动客户端: - redis-cli - 执行工程: - scrapy runspider xxx.py - 向调度器的队列中放入一个起始的url: - 调度器的队列在redis的客户端中 - lpush xxx www.xxx.com - 爬取到的数据存储在了redis的proName:items这个数据结构中 </code></pre> <h3>第九章 增量式爬虫</h3> <pre><code>增量式爬虫 - 概念:监测网站数据更新的情况,只会爬取网站最新更新出来的数据。 - 分析: - 指定一个起始url - 基于CrawlSpider获取其他页码链接 - 基于Rule将其他页码链接进行请求 - 从每一个页码对应的页面源码中解析出每一个电影详情页的URL - 核心:检测电影详情页的url之前有没有请求过 - 将爬取过的电影详情页的url存储 - 存储到redis的set数据结构 - 对详情页的url发起请求,然后解析出电影的名称和简介 - 进行持久化存储 </code></pre> <h1>动态加载页面分析、POST请求参数和内容爬取</h1> <p>https://blog.csdn.net/Strive_0902/article/details/88972722</p> <pre><code class="prism language-python"><span class="token keyword">import</span> requests <span class="token keyword">from</span> lxml <span class="token keyword">import</span> etree <span class="token keyword">import</span> time <span class="token keyword">import</span> os <span class="token keyword">import</span> sys <span class="token keyword">import</span> json ua <span class="token operator">=</span> <span class="token string">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"</span> cookie1 <span class="token operator">=</span> <span class="token string">"trs_uv=jtz38ebv_373_14pv; BIGipServerjigou=1079027904.20480.0000; JSESSIONID=gyDbm3t9JVAlnN7VBkEH7Gk9CrEcAsd65-YfiCCqMLv-IkyP53TY!499435313"</span> host1 <span class="token operator">=</span> <span class="token string">"jg.sac.net.cn"</span> orgin1 <span class="token operator">=</span> <span class="token string">"http://jg.sac.net.cn"</span> data1 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"filter_EQS_O#otc_id"</span><span class="token punctuation">:</span><span class="token string">"01"</span><span class="token punctuation">,</span><span class="token string">"filter_EQS_O#sac_id"</span><span class="token punctuation">:</span><span class="token string">""</span><span class="token punctuation">,</span><span class="token string">"filter_LIKES_aoi_name"</span><span class="token punctuation">:</span><span class="token string">""</span><span class="token punctuation">,</span><span class="token string">"sqlkey"</span><span class="token punctuation">:</span> <span class="token string">"publicity"</span><span class="token punctuation">,</span><span class="token string">"sqlval"</span><span class="token punctuation">:</span> <span class="token string">"ORG_BY_TYPE_INFO"</span><span class="token punctuation">}</span> headers1 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">'User-agent'</span><span class="token punctuation">:</span> ua<span class="token punctuation">,</span><span class="token string">'Cookie'</span><span class="token punctuation">:</span>cookie1<span class="token punctuation">,</span><span class="token string">'Host'</span><span class="token punctuation">:</span>host1<span class="token punctuation">,</span><span class="token string">'Orgin'</span><span class="token punctuation">:</span>orgin1<span class="token punctuation">}</span> Base_url <span class="token operator">=</span> <span class="token string">"http://jg.sac.net.cn/pages/publicity/resource!search.action"</span> page_url <span class="token operator">=</span> <span class="token string">"http://jg.sac.net.cn/pages/publicity/resource!search.action"</span> req <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>Base_url<span class="token punctuation">,</span>data <span class="token operator">=</span> data1<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers1<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>req<span class="token punctuation">.</span>text<span class="token punctuation">)</span> res <span class="token operator">=</span> req<span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token comment">#print(res[0]['AOI_ID'])</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>res<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> page_data1 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"filter_EQS_aoi_id"</span><span class="token punctuation">:</span> res<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'AOI_ID'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token string">"sqlkey"</span><span class="token punctuation">:</span> <span class="token string">"publicity"</span><span class="token punctuation">,</span> <span class="token string">"sqlval"</span><span class="token punctuation">:</span> <span class="token string">"SELECT_ZQ_REG_INFO"</span><span class="token punctuation">}</span> page_data2 <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token string">"filter_EQS_aoi_id"</span><span class="token punctuation">:</span> res<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'AOI_ID'</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token string">"sqlkey"</span><span class="token punctuation">:</span> <span class="token string">"publicity"</span><span class="token punctuation">,</span> <span class="token string">"sqlval"</span><span class="token punctuation">:</span> <span class="token string">"SEARCH_ZQGS_QUALIFATION"</span><span class="token punctuation">}</span> company_info <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span> page_req1 <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>page_url<span class="token punctuation">,</span> data<span class="token operator">=</span>page_data1<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers1<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> page_req2 <span class="token operator">=</span> requests<span class="token punctuation">.</span>post<span class="token punctuation">(</span>page_url<span class="token punctuation">,</span> data<span class="token operator">=</span>page_data2<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers1<span class="token punctuation">)</span><span class="token punctuation">.</span>json<span class="token punctuation">(</span><span class="token punctuation">)</span> company_info<span class="token punctuation">[</span><span class="token string">"Chinese_Name"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_CHINESE_NAME'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Info_Reg"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_INFO_REG'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Legal_Represent"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_LEGAL_REPRESENTATIVE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"License_Code"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_LICENSE_CODE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Reg_Capital"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_REG_CAPITAL'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Office_Address"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_OFFICE_ADDRESS'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Office_Post_Code"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_OFFICE_ZIP_CODE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Com_Website"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_COM_WEBSITE'</span><span class="token punctuation">]</span> company_info<span class="token punctuation">[</span><span class="token string">"Customer_Service_Tel"</span><span class="token punctuation">]</span> <span class="token operator">=</span> page_req1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'MRI_CUSTOMER_SERVICE_TEL'</span><span class="token punctuation">]</span> <span class="token comment"># print(page_req2)</span> <span class="token comment"># exit()</span> con <span class="token operator">=</span> <span class="token string">""</span> <span class="token keyword">for</span> j <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>page_req2<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> con <span class="token operator">+=</span> page_req2<span class="token punctuation">[</span>j<span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'PTSC_NAME'</span><span class="token punctuation">]</span><span class="token operator">+</span><span class="token string">","</span> company_info<span class="token punctuation">[</span><span class="token string">"Qualification_info"</span><span class="token punctuation">]</span> <span class="token operator">=</span> con <span class="token keyword">try</span><span class="token punctuation">:</span> <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string">"2.json"</span><span class="token punctuation">,</span> <span class="token string">'a+'</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">"utf-8"</span><span class="token punctuation">)</span> <span class="token keyword">as</span> fp<span class="token punctuation">:</span> fp<span class="token punctuation">.</span>write<span class="token punctuation">(</span>json<span class="token punctuation">.</span>dumps<span class="token punctuation">(</span>company_info<span class="token punctuation">,</span> ensure_ascii<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">"\n"</span><span class="token punctuation">)</span> <span class="token keyword">except</span> IOError <span class="token keyword">as</span> err<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'error'</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>err<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">finally</span><span class="token punctuation">:</span> fp<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">pass</span> </code></pre> <h3>http返回状态码详解:</h3> <p>https://blog.csdn.net/ithomer/article/details/10240351</p> <p>当用户点击或搜索引擎向网站服务器发出浏览请求时,服务器将返回Http Header Http头信息状态码,常见几种如下:</p> <p>1、Http/1.1 200 OK 访问正常<br> 表示成功访问,为网站可正常访问时的状态。</p> <p>2、Http/1.1 301 Moved Permanently 301重定向永久重定向<br> 对搜索引擎相对友好的跳转方式,当网站更换域名时可将原域名作301永久重定向到新域名,原域名权重可传递到新域名,也常有将不含www的域名301跳转到含www的,如xxx.com通过301跳转到www.xxx.com</p> <p>3、Http/1.1 302 Found 为临时重定向<br> 易被搜索引擎判为作弊,比如asp程序的response.Redirect()跳转、js跳转或静态http跳转。</p> <p>4、Http/1.1 400 Bad Request 域名绑定错误<br> 一般是服务器上域名未绑定成功,未备案等情况。</p> <p>5、Http/1.1 403 Forbidden 没有权限访问此站<br> 你的IP被列入黑名单,连接的用户过多,可以过后再试,网站域名解析到了空间,但空间未绑定此域名等情况。</p> <p>6、Http/1.1 404 Not Found 文件或目录不存在<br> 表示请求文件、目录不存在或删除,设置404错误页时需确保返回值为404。常有因为404错误页设置不当导致不存在的网页返回的不是404而导致搜索引擎降权。</p> <p>7、Http/1.1 500 Internal Server Error 程序或服务器错误<br> 表示服务器内部程序错误,出现这样的提示一般是程序页面中出现错误,如小的语法错误,数据连接故障等。</p> <h3>curl</h3> <table> <thead> <tr> <th>参数</th> <th>说明</th> <th>示例</th> </tr> </thead> <tbody> <tr> <td>-A</td> <td>设置user-agent</td> <td>curl -A “Chrome” http://www.baidu.com</td> </tr> <tr> <td>-X</td> <td>用指定方法请求</td> <td>curl -X POST http://httpbin.org/post</td> </tr> <tr> <td>-I</td> <td>只返回请求的头信息</td> <td></td> </tr> <tr> <td>-d</td> <td>以POST方法请求url,并发送相应的参数</td> <td>-d a=1 -d b=2 -d c=3 | -d “a=1&b=2&c=3” |-d @filename</td> </tr> <tr> <td>-O</td> <td>下载文件并以远程的文件名保存</td> <td></td> </tr> <tr> <td>-o</td> <td>下载文件并以指定的文件名保存</td> <td></td> </tr> <tr> <td>-H</td> <td>设置头信息</td> <td></td> </tr> <tr> <td>-k</td> <td>允许发起不安全的SSL请求</td> <td></td> </tr> </tbody> </table> <p>https://www.ruanyifeng.com/blog/2019/09/curl-reference.html</p> <h1>AJAX 尚硅谷教程</h1> <p>https://www.wrysmile.cn/Learn-AJAX.html</p> <h3>一、基础内容</h3> <h4>1.AJAX</h4> <ul> <li>AJAX 是异步的 JS 和 XML,通过 AJAX 可以在浏览器中向服务器中发送异步请求</li> <li>优点: <ul> <li>可以无需刷新页面与服务器进行通信</li> <li>允许根据用户时间来更新部分页面内容</li> </ul> </li> <li>缺点: <ul> <li>没有浏览历史,不能回退</li> <li>存在跨域问题(同源)</li> <li>SEO 不太好</li> </ul> </li> </ul> <h4>2.XML</h4> <ul> <li> <p>XML 被设计用来传输和存储数据</p> </li> <li> <h3>(1).请求报文</h3> <ul> <li> <p>请求行:GET或POST / url / HTTP协议版本</p> </li> <li> <p>请求头:格式为</p> <p>键值对</p> <ul> <li>Host:xxxx</li> <li>Cookie:name=wrysmile</li> </ul> </li> <li> <p>请求空行:固定的</p> </li> <li> <p>请求体:</p> <ul> <li>如果请求行是 GET 请求,请求体就为空</li> <li>如果请求行是 POST 请求,请求体可以不为空</li> </ul> </li> </ul> <h3>(2).响应报文</h3> <ul> <li>响应行:HTTP协议版本 / 响应状态码 / 响应状态字符串 <ul> <li>1xx:信息,服务器收到请求,需要请求者继续执行操作</li> <li>2xx:成功,操作被成功接收并处理</li> <li>3xx:重定向,需要进一步的操作以完成请求</li> <li>4xx:客户端错误,请求包含语法错误或无法完成请求</li> <li>5xx:服务器错误,服务器在处理请求的过程中发生了错误</li> <li>具体状态码可以看这里</li> </ul> </li> <li>响应头: <ul> <li>Content-Type:text/html;charset=utf-8</li> </ul> </li> <li>响应空行:固定必须有</li> <li>响应体:html中的所有内容</li> </ul> <h2>xml 与 html 的区别:</h2> </li> <li> <ul> <li>前者没有预定义标签,全是自定义标签,用来表示一些数据</li> <li>后者都是预定义标签</li> </ul> </li> <li> <p>目前已被 JSON 取代</p> </li> </ul> <p>Express服务器端框架:简单框架使用</p> <h4>3.HTTP</h4> <ul> <li>超文本传输协议,详细规定了浏览器和万维网服务器之间互相通信的规则</li> </ul> <h3></h3> </div> </div> </div> </div> </div> <!--PC和WAP自适应版--> <div id="SOHUCS" sid="1529245259433799680"></div> <script type="text/javascript" src="/views/front/js/chanyan.js"></script> <!-- 文章页-底部 动态广告位 --> <div class="youdao-fixed-ad" id="detail_ad_bottom"></div> </div> <div class="col-md-3"> <div class="row" id="ad"> <!-- 文章页-右侧1 动态广告位 --> <div id="right-1" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_1"> </div> </div> <!-- 文章页-右侧2 动态广告位 --> <div id="right-2" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_2"></div> </div> <!-- 文章页-右侧3 动态广告位 --> <div id="right-3" class="col-lg-12 col-md-12 col-sm-4 col-xs-4 ad"> <div class="youdao-fixed-ad" id="detail_ad_3"></div> </div> </div> </div> </div> </div> </div> <div class="container"> <h4 class="pt20 mb15 mt0 border-top">你可能感兴趣的:(爬虫,python,爬虫,pycharm)</h4> <div id="paradigm-article-related"> <div class="recommend-post mb30"> <ul class="widget-links"> <li><a href="/article/1834839244410023936.htm" title="python卡方检验计算pvalue值_Python数据科学:卡方检验" target="_blank">python卡方检验计算pvalue值_Python数据科学:卡方检验</a> <span class="text-muted">CodeWhiz</span> <div>之前已经介绍的变量分析:①相关分析:一个连续变量与一个连续变量间的关系。②双样本t检验:一个二分分类变量与一个连续变量间的关系。③方差分析:一个多分类分类变量与一个连续变量间的关系。本次介绍:卡方检验:一个二分分类变量或多分类分类变量与一个二分分类变量间的关系。如果其中一个变量的分布随着另一个变量的水平不同而发生变化时,那么两个分类变量就有关系。卡方检验并不能展现出两个分类变量相关性的强弱,只能展</div> </li> <li><a href="/article/1834834201300529152.htm" title="Java-后端程序员个人知识总结" target="_blank">Java-后端程序员个人知识总结</a> <span class="text-muted">金肴羽</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>文章目录概要1.编程语言2.数据结构与算法3.数据库知识4.框架和库5.服务器管理6.网络知识7.版本控制8.测试9.安全知识10.系统设计11.编码规范与最佳实践12.持续学习和适应能力概要后端程序员,主要负责应用程序的逻辑、数据库交互、服务器配置以及应用的性能优化等。成为一名优秀的后台程序员,需要掌握以下技能:1.编程语言掌握至少一种后台编程语言JavaPythonHtmlJavaScript</div> </li> <li><a href="/article/1834826509500641280.htm" title="python 卡方检验_Python-卡方检验" target="_blank">python 卡方检验_Python-卡方检验</a> <span class="text-muted">cunzai1985</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/numpy/1.htm">numpy</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a><a class="tag" taget="_blank" href="/search/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/1.htm">机器学习</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98/1.htm">数据挖掘</a> <div>python卡方检验Python-卡方检验(Python-Chi-SquareTest)Chi-Squaretestisastatisticalmethodtodetermineiftwocategoricalvariableshaveasignificantcorrelationbetweenthem.Boththosevariablesshouldbefromsamepopulationand</div> </li> <li><a href="/article/1834825373909610496.htm" title="centos下安装python3" target="_blank">centos下安装python3</a> <span class="text-muted">i0208</span> <a class="tag" taget="_blank" href="/search/centos/1.htm">centos</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>Centos7默认自带了Python2.7版本,但是因为项目需要使用Python3.x你可以按照此文的三个方法进行安装.注:本文示例安装版本为Python3.5,一、Python源代码编译安装安装必要工具yum-utils,它的功能是管理repository及扩展包的工具(主要是针对repository)$sudoyuminstallyum-utils使用yum-builddep为Python3构</div> </li> <li><a href="/article/1834825245815566336.htm" title="【Python・统计学】威尔科克森符号秩检验/Wilcoxon signed-rank test(原理及代码)" target="_blank">【Python・统计学】威尔科克森符号秩检验/Wilcoxon signed-rank test(原理及代码)</a> <span class="text-muted">TUTO_TUTO</span> <a class="tag" taget="_blank" href="/search/%E7%BB%9F%E8%AE%A1%E5%AD%A6/1.htm">统计学</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>前言自学笔记,分享给对统计学原理不太清楚但需要在论文中用到的小伙伴,欢迎大佬们补充或绕道。ps:本文不涉及公式讲解(文科生小白友好体质)~(部分定义等来源于知乎百度等)本文重点:威尔科克森符号秩检验(英文名:Wilcoxonsigned-ranktest)【1.简单原理和步骤】【2.应用条件】【3.数据实例以及Python代码】1.简单原理和步骤威尔科克森符号秩检验是一种非参数检验的方法,需要数据</div> </li> <li><a href="/article/1834825246964805632.htm" title="【Python・统计学】Kruskal-Wallis检验/H检验(原理及代码)" target="_blank">【Python・统计学】Kruskal-Wallis检验/H检验(原理及代码)</a> <span class="text-muted">TUTO_TUTO</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E7%BB%9F%E8%AE%A1%E5%AD%A6/1.htm">统计学</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>前言自学笔记,分享给对统计学原理不太清楚但需要在论文中用到的小伙伴,欢迎大佬们补充或绕道。ps:本文不涉及公式讲解(文科生小白友好体质)~(部分定义等来源于知乎百度等)本文重点:Kruskal-Wallis检验(Kruskal-Wallistest),也称H检验【1.定义和简单原理】【2.应用条件】【3.数据实例以及Python代码】【4.多重比较(例:Dunn检验)】1.定义和简单原理Krusk</div> </li> <li><a href="/article/1834825116668751872.htm" title="【Python・统计学】单因素方差分析(简单原理及代码)" target="_blank">【Python・统计学】单因素方差分析(简单原理及代码)</a> <span class="text-muted">TUTO_TUTO</span> <a class="tag" taget="_blank" href="/search/%E7%BB%9F%E8%AE%A1%E5%AD%A6/1.htm">统计学</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0/1.htm">笔记</a> <div>前言自学笔记,分享给对统计学原理不太清楚但需要在论文中用到的小伙伴,欢迎大佬们补充或绕道。ps:本文不涉及公式讲解(文科生小白友好体质)~本文重点:单因素方差分析(以下:方差分析)【1.方差分析简单原理和前提条件】【2.方差分析和t检验的区别】【3.方差分析代码(配对/独立+事后检验+效应量)】1.方差分析简单原理方差分析(ANOVA)又称“变异数分析”或“F检验”,是由罗纳德·费雪爵士发明的,用</div> </li> <li><a href="/article/1834824485384056832.htm" title="【15.4 python中,wxPython框架的BoxSizer布局】" target="_blank">【15.4 python中,wxPython框架的BoxSizer布局】</a> <span class="text-muted">wang151038606</span> <a class="tag" taget="_blank" href="/search/python%E8%AF%AD%E8%A8%80%E5%85%A5%E9%97%A8%E5%AD%A6%E4%B9%A0/1.htm">python语言入门学习</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>python中,以wxPython框架为例,介绍一下BoxSizer布局在wxPython中,BoxSizer是一种常用的布局管理器,它允许你以水平或垂直的方式排列控件。BoxSizer会基于控件的请求大小以及容器中可用的空间来动态地调整控件的大小和位置。它非常适合于创建简单的一维布局,如工具栏、菜单栏或侧边栏等。在wxPython中,除了BoxSizer和GridSizer之外,还有其他几种si</div> </li> <li><a href="/article/1834824485933510656.htm" title="【3.6 python中的numpy编写一个“手写数字识”的神经网络】" target="_blank">【3.6 python中的numpy编写一个“手写数字识”的神经网络】</a> <span class="text-muted">wang151038606</span> <a class="tag" taget="_blank" href="/search/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E5%85%A5%E9%97%A8/1.htm">深度学习入门</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/numpy/1.htm">numpy</a><a class="tag" taget="_blank" href="/search/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C/1.htm">神经网络</a> <div>3.6python中的numpy编写一个“手写数字识”的神经网络要使用Python中的NumPy库从头开始编写一个“手写数字识别”的神经网络,我们通常会处理MNIST数据集,这是一个广泛使用的包含手写数字的图像数据集。但是,完全用NumPy来实现神经网络(包括数据的加载、预处理、模型定义、前向传播、损失计算、反向传播和权重更新)是一个相当复杂的任务,因为NumPy本身不提供自动微分或高级优化算法(</div> </li> <li><a href="/article/1834819445147660288.htm" title="Python酷库之旅-第三方库Pandas(115)" target="_blank">Python酷库之旅-第三方库Pandas(115)</a> <span class="text-muted">神奇夜光杯</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/pandas/1.htm">pandas</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/%E4%BA%BA%E5%B7%A5%E6%99%BA%E8%83%BD/1.htm">人工智能</a><a class="tag" taget="_blank" href="/search/%E6%A0%87%E5%87%86%E5%BA%93%E5%8F%8A%E7%AC%AC%E4%B8%89%E6%96%B9%E5%BA%93/1.htm">标准库及第三方库</a><a class="tag" taget="_blank" href="/search/excel/1.htm">excel</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0%E4%B8%8E%E6%88%90%E9%95%BF/1.htm">学习与成长</a> <div>目录一、用法精讲506、pandas.DataFrame.rank方法506-1、语法506-2、参数506-3、功能506-4、返回值506-5、说明506-6、用法506-6-1、数据准备506-6-2、代码示例506-6-3、结果输出507、pandas.DataFrame.round方法507-1、语法507-2、参数507-3、功能507-4、返回值507-5、说明507-6、用法507</div> </li> <li><a href="/article/1834818436371410944.htm" title="Python 安装 Selenium 报错解决方案:全方位排错指南" target="_blank">Python 安装 Selenium 报错解决方案:全方位排错指南</a> <span class="text-muted">小柒笔记</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/selenium/1.htm">selenium</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>引言在尝试使用pip安装Selenium库时,您可能会遇到中断报错,这通常是由于多种原因造成的,如网络问题、权限问题或依赖项缺失等。本文将指导您如何解决这一常见问题。一、检查网络连接首先,确保您的网络连接稳定。pip安装过程中需要从互联网下载包,因此网络不稳定可能导致安装失败。二、使用管理员权限运行在Windows系统中,尝试使用管理员权限运行命令提示符或PowerShell。右键点击命令提示符或</div> </li> <li><a href="/article/1834816166367948800.htm" title="使用vllIm部署大语言模型" target="_blank">使用vllIm部署大语言模型</a> <span class="text-muted">添砖JAVA的小墨</span> <a class="tag" taget="_blank" href="/search/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/1.htm">机器学习</a> <div>使用vllm部署大语言模型一般需要以下步骤:一、准备工作1.系统要求-操作系统:常见的Linux发行版(如Ubuntu、CentOS)或Windows(通过WSL)。-GPU支持:NVIDIAGPU并安装了适当的驱动程序。-足够的内存和存储空间。2.安装依赖-Python3.8及以上版本。-CUDA工具包(根据GPU型号选择合适的版本)。二、安装vllm1.创建虚拟环境(推荐)-使用Conda:c</div> </li> <li><a href="/article/1834808355558879232.htm" title="python--排错--AttributeError: 'str' object has no attribute 'decode',关于python3的字符串" target="_blank">python--排错--AttributeError: 'str' object has no attribute 'decode',关于python3的字符串</a> <span class="text-muted">我不是庸医</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E6%8E%92%E9%94%99%E8%AE%B0%E5%BD%95/1.htm">排错记录</a> <div>AttributeError:'str'objecthasnoattribute'decode'一般是因为str的类型本身不是bytes,所以不能解码两个概念:普通str:可理解的语义字节流str(bytes)(0101010101,可视化显示)两个语法Encode:把普通字符串转为机器可识别的bytesDecode:把bytes转为字符串两个差异Python3的str默认不是bytes,所以不能</div> </li> <li><a href="/article/1834807218839580672.htm" title="Python数据分析之股票信息可视化实现matplotlib" target="_blank">Python数据分析之股票信息可视化实现matplotlib</a> <span class="text-muted">Blogfish</span> <a class="tag" taget="_blank" href="/search/Python3/1.htm">Python3</a><a class="tag" taget="_blank" href="/search/%E5%A4%A7%E6%95%B0%E6%8D%AE/1.htm">大数据</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%8F%AF%E8%A7%86%E5%8C%96/1.htm">可视化</a><a class="tag" taget="_blank" href="/search/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/1.htm">数据分析</a> <div>今天学习爬虫技术数据分析对于股票信息的分析及结果呈现,目标是实现对股票信息的爬取并对数据整理后,生成近期成交量折线图。首先,做这个案例一定要有一个明确的思路。知道要干啥,知道用哪些知识,有些方法我也记不住百度下知识库很强大,肯定有答案。有思路以后准备对数据处理,就是几个方法使用了。接口地址参考:Tushare数据涉及知识库:tushare-一个财经数据开放接口;pandas-实现将数据整理为表格,</div> </li> <li><a href="/article/1834804218410659840.htm" title="用 Python 写网络编程(三)" target="_blank">用 Python 写网络编程(三)</a> <span class="text-muted">TesterHome</span> <div>本文在2021.02.14首发于TesterHome社区,作者是资深游戏测试开发工程师陈子昂。用Python写网络编程共四篇,今天给大家分享其中第三篇。原文链接:https://testerhome.com/topics/27910前言今天是一个特别的节日,1946年情人节,世界上第一台计算机ENIAC在米国的宾夕法尼亚大学被new了,标志着新的时代到来。计算机陪伴人类已经走过了75个年头,所以今</div> </li> <li><a href="/article/1834802053252214784.htm" title="AttributeError: ‘str’ object has no attribute ‘get’" target="_blank">AttributeError: ‘str’ object has no attribute ‘get’</a> <span class="text-muted">云天徽上</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a><a class="tag" taget="_blank" href="/search/pandas/1.htm">pandas</a><a class="tag" taget="_blank" href="/search/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/1.htm">机器学习</a><a class="tag" taget="_blank" href="/search/numpy/1.htm">numpy</a> <div>【Python】成功解决AttributeError:‘str’objecthasnoattribute‘get’在Python编程中,遇到AttributeError是一个常见的错误,它通常表明你尝试访问的对象不具备你正在调用的属性或方法。当错误信息为“AttributeError:‘str’objecthasnoattribute‘get’”时,这通常意味着你错误地将一个字符串(str)对象当</div> </li> <li><a href="/article/1834793356543225856.htm" title="由于篇幅和复杂性限制,我无法在这里直接为你提供一个完整的、用多种编程语言实现的购物商城代码。但是,我可以为你概述如何使用几种流行的编程语言(如Python, JavaScript/Node.js, J" target="_blank">由于篇幅和复杂性限制,我无法在这里直接为你提供一个完整的、用多种编程语言实现的购物商城代码。但是,我可以为你概述如何使用几种流行的编程语言(如Python, JavaScript/Node.js, J</a> <span class="text-muted">NewmanEdwarda2</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/javascript/1.htm">javascript</a><a class="tag" taget="_blank" href="/search/node.js/1.htm">node.js</a> <div>Python(后端,使用Flask或Django)Flask示例(后端API)gjijg.compythonfromflaskimportFlask,request,jsonifyapp=Flask(name)假设的数据库商品列表products=[{“id”:1,“name”:“苹果”,“price”:10.0},{“id”:2,“name”:“香蕉”,“price”:5.0},]@app.ro</div> </li> <li><a href="/article/1834792978640629760.htm" title="由于篇幅限制,我无法为每种编程语言都提供一个完整的游戏商城代码,但我可以为你提供几种常见编程语言的示例代码片段或概念性指导。" target="_blank">由于篇幅限制,我无法为每种编程语言都提供一个完整的游戏商城代码,但我可以为你提供几种常见编程语言的示例代码片段或概念性指导。</a> <span class="text-muted">NewmanEdwarda2</span> <a class="tag" taget="_blank" href="/search/%E6%B8%B8%E6%88%8F/1.htm">游戏</a> <div>Python(使用Flask框架)yctsy.cnFlaskApp结构pythonfromflaskimportFlask,render_template,requestapp=Flask(name)假设有一个数据库或列表来存储商品games=[{“id”:1,“name”:“Game1”,“price”:9.99},#…其他游戏]@app.route(‘/’)defhome():returnre</div> </li> <li><a href="/article/1834792978158284800.htm" title="由于生成一个完整的游戏商城代码涉及很多细节和复杂性,我无法在这里直接给出完整的代码示例。但是,我可以为你提供一个简化版的游戏商城核心功能的概念性代码,用几种不同的编程语言来展示。" target="_blank">由于生成一个完整的游戏商城代码涉及很多细节和复杂性,我无法在这里直接给出完整的代码示例。但是,我可以为你提供一个简化版的游戏商城核心功能的概念性代码,用几种不同的编程语言来展示。</a> <span class="text-muted">NewmanEdwarda2</span> <a class="tag" taget="_blank" href="/search/%E6%B8%B8%E6%88%8F/1.htm">游戏</a> <div>Python(伪代码)pythonclassGame:definit(self,name,price):tcjmbj.cnself.name=nameself.price=priceclassGameStore:definit(self):self.games=[]defadd_game(self,game):self.games.append(game)defbuy_game(self,game</div> </li> <li><a href="/article/1834792599198724096.htm" title="深入解读《Python之禅》:用实战代码诠释Python编程哲学20240914" target="_blank">深入解读《Python之禅》:用实战代码诠释Python编程哲学20240914</a> <span class="text-muted">Narutolxy</span> <a class="tag" taget="_blank" href="/search/Python%E7%AC%94%E8%AE%B0/1.htm">Python笔记</a><a class="tag" taget="_blank" href="/search/%E6%8A%80%E6%9C%AF%E5%B9%B2%E8%B4%A7%E5%88%86%E4%BA%AB/1.htm">技术干货分享</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>深入解读《Python之禅》:用实战代码诠释Python编程哲学《Python之禅》(TheZenofPython)是Python语言的设计哲学,由TimPeters总结,包含了19条简洁而深刻的格言。当你在Python解释器中输入importthis时,这些格言便会展现在你的眼前。它们不仅仅是简单的句子,更是指导Python程序员编写优雅、简洁、可读代码的准则。本文将结合实际的最佳实践代码,逐条</div> </li> <li><a href="/article/1834791969054879744.htm" title="Java 基于 SpringBoot+vue 的大学生科创项目在线管理系统(附源码)" target="_blank">Java 基于 SpringBoot+vue 的大学生科创项目在线管理系统(附源码)</a> <span class="text-muted">程序员徐师兄</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/spring/1.htm">spring</a><a class="tag" taget="_blank" href="/search/boot/1.htm">boot</a><a class="tag" taget="_blank" href="/search/vue.js/1.htm">vue.js</a><a class="tag" taget="_blank" href="/search/%E5%A4%A7%E5%AD%A6%E7%94%9F%E7%A7%91%E5%88%9B%E9%A1%B9%E7%9B%AE%E7%AE%A1%E7%90%86%E7%B3%BB%E7%BB%9F/1.htm">大学生科创项目管理系统</a><a class="tag" taget="_blank" href="/search/%E5%A4%A7%E5%AD%A6%E7%94%9F%E7%A7%91%E5%88%9B%E9%A1%B9%E7%9B%AE/1.htm">大学生科创项目</a> <div>博主介绍:✌程序员徐师兄、7年大厂程序员经历。全网粉丝12w+、csdn博客专家、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌文末获取源码联系精彩专栏推荐订阅不然下次找不到哟2022-2024年最全的计算机软件毕业设计选题大全:1000个热门选题推荐✅Java项目精品实战案例《100套》Java微信小程序项目实战《100套》Python项目实战《100套》</div> </li> <li><a href="/article/1834786423614566400.htm" title="【Python】解决Python报错:AttributeError: ‘str‘ object has no attribute ‘xxx‘" target="_blank">【Python】解决Python报错:AttributeError: ‘str‘ object has no attribute ‘xxx‘</a> <span class="text-muted">I'mAlex</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>成功解决Python报错:AttributeError:‘str‘objecthasnoattribute‘xxx‘。AttributeError:'str'objecthasnoattribute'xxx'错误发生的常见原因包括:1.属性不存在:尝试访问字符串类型对象中不存在的属性。2.变量类型混淆:试图访问的变量在程序运行过程中,本应是另一种对象类型,但却意外地变成了str类型。3.类型转换错</div> </li> <li><a href="/article/1834784786061815808.htm" title="【Python技术学习】- 如何搭建一个爬虫代理服务?" target="_blank">【Python技术学习】- 如何搭建一个爬虫代理服务?</a> <span class="text-muted">xiaoli8748_软件开发</span> <a class="tag" taget="_blank" href="/search/python%E6%8A%80%E6%9C%AF%E5%AD%A6%E4%B9%A0/1.htm">python技术学习</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0/1.htm">学习</a><a class="tag" taget="_blank" href="/search/%E7%88%AC%E8%99%AB/1.htm">爬虫</a> <div>由于之前一直在做爬虫采集相关的开发,这个过程那肯定少不了跟「代理IP」打交道,这篇文章就来记录一下,如何实现一个爬虫代理服务,本篇文章主要以讲解思路为主。起因做过爬虫的人应该都知道,抓的网站和数据多了,如果爬虫抓取速度过快,免不了触发网站的防爬机制。而这些网站应对爬虫的办法,几乎用的同一招就是封IP。那么我们还想稳定、持续地抓取这些网站的数据,如何解决呢?一般解决方案有2个:使用同一个服务器IP抓</div> </li> <li><a href="/article/1834782013412962304.htm" title="python底层原理讲解_python底层原理" target="_blank">python底层原理讲解_python底层原理</a> <span class="text-muted">空蝉于是</span> <a class="tag" taget="_blank" href="/search/python%E5%BA%95%E5%B1%82%E5%8E%9F%E7%90%86%E8%AE%B2%E8%A7%A3/1.htm">python底层原理讲解</a> <div>有同学问到了一个问题,python中存储变量是通过内存地址来存储,那么python又是如何去判断内存中的地址是什么数据类型的呢。经过查找,找到这篇文章:原博客地址:http://www.cnblogs.com/aashui/p/9871009.html1.Python是如何进行内存管理的?答:从三个方面来说,一对象的引用计数机制,二垃圾回收机制,三内存池机制一、对象的引用计数机制Python内部使</div> </li> <li><a href="/article/1834777218480435200.htm" title="Python计算机视觉编程 第三章 图像到图像的映射" target="_blank">Python计算机视觉编程 第三章 图像到图像的映射</a> <span class="text-muted">一只小小程序猿</span> <a class="tag" taget="_blank" href="/search/%E8%AE%A1%E7%AE%97%E6%9C%BA%E8%A7%86%E8%A7%89/1.htm">计算机视觉</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/opencv/1.htm">opencv</a> <div>目录单应性变换直接线性变换算法仿射变换图像扭曲图像中的图像分段仿射扭曲创建全景图RANSAC拼接图像单应性变换单应性变换是将一个平面内的点映射到另一个平面内的二维投影变换。在这里,平面是指图像或者三维中的平面表面。单应性变换具有很强的实用性,比如图像配准、图像纠正和纹理扭曲,以及创建全景图像。单应性变换本质上是一种二维到二维的映射,可以将一个平面内的点映射到另一个平面上的对应点。代码如下:impo</div> </li> <li><a href="/article/1834755041479716864.htm" title="unicorn 部署 FastAPI 应用程序" target="_blank">unicorn 部署 FastAPI 应用程序</a> <span class="text-muted">九品神元师</span> <a class="tag" taget="_blank" href="/search/fastapi/1.htm">fastapi</a> <div>本地部署本地开发调试过程中,我通常是这样启动Fastapi服务的在终端中运行:uvicornmain:app--host0.0.0.0--port80当然,也可以python脚本启动:importuvicorn​uvicorn.run(app="main:app",host="0.0.0.0",port=8088,reload=True)这样就好启动一个服务,reload=True支持热重载,方便</div> </li> <li><a href="/article/1834754537571840000.htm" title="从零开始!Jupyter Notebook的安装教程" target="_blank">从零开始!Jupyter Notebook的安装教程</a> <span class="text-muted">yunquantong</span> <a class="tag" taget="_blank" href="/search/jupyter/1.htm">jupyter</a><a class="tag" taget="_blank" href="/search/ide/1.htm">ide</a><a class="tag" taget="_blank" href="/search/python/1.htm">python</a> <div>以下是从零开始安装JupyterNotebook的教程,适用于Windows、macOS和Linux系统。1.安装PythonJupyterNotebook需要Python环境。你可以从Python官方网站下载并安装Python。Windows用户:运行安装程序时,请确保勾选“AddPythontoPATH”选项。macOS用户:使用安装程序或通过Homebrew安装(brewinstallpyt</div> </li> <li><a href="/article/1834748359634677760.htm" title="python sanic orm_sanic中使用tortoise-orm" target="_blank">python sanic orm_sanic中使用tortoise-orm</a> <span class="text-muted">Mr浪子相依</span> <a class="tag" taget="_blank" href="/search/python/1.htm">python</a><a class="tag" taget="_blank" href="/search/sanic/1.htm">sanic</a><a class="tag" taget="_blank" href="/search/orm/1.htm">orm</a> <div>#models.pyfromtortoise.modelsimportModelfromtortoiseimportfieldsclassUser(Model):id=fields.IntField(pk=True,,source_field="userID")name=fields.CharField(max_length=100)date_field=fields.DateTimeField(</div> </li> <li><a href="/article/1834746108983734272.htm" title="盘点一个Python网络爬虫抓取股票代码问题(上篇)" target="_blank">盘点一个Python网络爬虫抓取股票代码问题(上篇)</a> <span class="text-muted">皮皮_f075</span> <div>大家好,我是皮皮。一、前言前几天在Python白银群【厚德载物】问了一个Python网络爬虫的问题,这里拿出来给大家分享下。image.png二、实现过程这个问题其实for循环就可以搞定了,看上去粉丝的代码没有带请求头那些,导致获取不到数据。后来【瑜亮老师】、【小王子】给了具体思路,代码如下图所示:image.png后来【小王子】也给了一个具体代码,如下:importrequestsimportt</div> </li> <li><a href="/article/1834739283336982528.htm" title="Django 安装指南" target="_blank">Django 安装指南</a> <span class="text-muted">lly202406</span> <a class="tag" taget="_blank" href="/search/%E5%BC%80%E5%8F%91%E8%AF%AD%E8%A8%80/1.htm">开发语言</a> <div>Django安装指南Django是一个高级的PythonWeb框架,它鼓励快速开发和干净、实用的设计。本指南将详细介绍如何在不同的操作系统上安装Django,包括Windows、macOS和Linux。在Windows上安装Django先决条件Python:Django要求Python3.8或更高版本。可以从Python官网下载适用于Windows的Python安装程序。pip:Python的包管</div> </li> <li><a href="/article/57.htm" title="多线程编程之join()方法" target="_blank">多线程编程之join()方法</a> <span class="text-muted">周凡杨</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/JOIN/1.htm">JOIN</a><a class="tag" taget="_blank" href="/search/%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">多线程</a><a class="tag" taget="_blank" href="/search/%E7%BC%96%E7%A8%8B/1.htm">编程</a><a class="tag" taget="_blank" href="/search/%E7%BA%BF%E7%A8%8B/1.htm">线程</a> <div>现实生活中,有些工作是需要团队中成员依次完成的,这就涉及到了一个顺序问题。现在有T1、T2、T3三个工人,如何保证T2在T1执行完后执行,T3在T2执行完后执行?问题分析:首先问题中有三个实体,T1、T2、T3, 因为是多线程编程,所以都要设计成线程类。关键是怎么保证线程能依次执行完呢?   Java实现过程如下: public class T1 implements Runnabl</div> </li> <li><a href="/article/184.htm" title="java中switch的使用" target="_blank">java中switch的使用</a> <span class="text-muted">bingyingao</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/enum/1.htm">enum</a><a class="tag" taget="_blank" href="/search/break/1.htm">break</a><a class="tag" taget="_blank" href="/search/continue/1.htm">continue</a> <div>java中的switch仅支持case条件仅支持int、enum两种类型。 用enum的时候,不能直接写下列形式。 switch (timeType) { case ProdtransTimeTypeEnum.DAILY: break; default: br</div> </li> <li><a href="/article/311.htm" title="hive having count 不能去重" target="_blank">hive having count 不能去重</a> <span class="text-muted">daizj</span> <a class="tag" taget="_blank" href="/search/hive/1.htm">hive</a><a class="tag" taget="_blank" href="/search/%E5%8E%BB%E9%87%8D/1.htm">去重</a><a class="tag" taget="_blank" href="/search/having+count/1.htm">having count</a><a class="tag" taget="_blank" href="/search/%E8%AE%A1%E6%95%B0/1.htm">计数</a> <div>hive在使用having count()是,不支持去重计数   hive (default)> select imei from t_test_phonenum where ds=20150701 group by imei having count(distinct phone_num)>1 limit 10;  FAILED: SemanticExcep</div> </li> <li><a href="/article/438.htm" title="WebSphere对JSP的缓存" target="_blank">WebSphere对JSP的缓存</a> <span class="text-muted">周凡杨</span> <a class="tag" taget="_blank" href="/search/WAS+JSP+%E7%BC%93%E5%AD%98/1.htm">WAS JSP 缓存</a> <div>      对于线网上的工程,更新JSP到WebSphere后,有时会出现修改的jsp没有起作用,特别是改变了某jsp的样式后,在页面中没看到效果,这主要就是由于websphere中缓存的缘故,这就要清除WebSphere中jsp缓存。要清除WebSphere中JSP的缓存,就要找到WAS安装后的根目录。        现服务</div> </li> <li><a href="/article/565.htm" title="设计模式总结" target="_blank">设计模式总结</a> <span class="text-muted">朱辉辉33</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E8%AE%BE%E8%AE%A1%E6%A8%A1%E5%BC%8F/1.htm">设计模式</a> <div>1.工厂模式   1.1 工厂方法模式 (由一个工厂类管理构造方法)      1.1.1普通工厂模式(一个工厂类中只有一个方法)      1.1.2多工厂模式(一个工厂类中有多个方法)      1.1.3静态工厂模式(将工厂类中的方法变成静态方法) &n</div> </li> <li><a href="/article/692.htm" title="实例:供应商管理报表需求调研报告" target="_blank">实例:供应商管理报表需求调研报告</a> <span class="text-muted">老A不折腾</span> <a class="tag" taget="_blank" href="/search/finereport/1.htm">finereport</a><a class="tag" taget="_blank" href="/search/%E6%8A%A5%E8%A1%A8%E7%B3%BB%E7%BB%9F/1.htm">报表系统</a><a class="tag" taget="_blank" href="/search/%E6%8A%A5%E8%A1%A8%E8%BD%AF%E4%BB%B6/1.htm">报表软件</a><a class="tag" taget="_blank" href="/search/%E4%BF%A1%E6%81%AF%E5%8C%96%E9%80%89%E5%9E%8B/1.htm">信息化选型</a> <div>引言 随着企业集团的生产规模扩张,为支撑全球供应链管理,对于供应商的管理和采购过程的监控已经不局限于简单的交付以及价格的管理,目前采购及供应商管理各个环节的操作分别在不同的系统下进行,而各个数据源都独立存在,无法提供统一的数据支持;因此,为了实现对于数据分析以提供采购决策,建立报表体系成为必须。 业务目标 1、通过报表为采购决策提供数据分析与支撑 2、对供应商进行综合评估以及管理,合理管理和</div> </li> <li><a href="/article/819.htm" title="mysql" target="_blank">mysql</a> <span class="text-muted">林鹤霄</span> <div>转载源:http://blog.sina.com.cn/s/blog_4f925fc30100rx5l.html mysql -uroot -p ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)   [root@centos var]# service mysql</div> </li> <li><a href="/article/946.htm" title="Linux下多线程堆栈查看工具(pstree、ps、pstack)" target="_blank">Linux下多线程堆栈查看工具(pstree、ps、pstack)</a> <span class="text-muted">aigo</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a> <div>原文:http://blog.csdn.net/yfkiss/article/details/6729364   1. pstree pstree以树结构显示进程$ pstree -p work | grep adsshd(22669)---bash(22670)---ad_preprocess(4551)-+-{ad_preprocess}(4552)  &n</div> </li> <li><a href="/article/1073.htm" title="html input与textarea 值改变事件" target="_blank">html input与textarea 值改变事件</a> <span class="text-muted">alxw4616</span> <a class="tag" taget="_blank" href="/search/JavaScript/1.htm">JavaScript</a> <div>// 文本输入框(input) 文本域(textarea)值改变事件 // onpropertychange(IE) oninput(w3c) $('input,textarea').on('propertychange input', function(event) {      console.log($(this).val()) });   </div> </li> <li><a href="/article/1200.htm" title="String类的基本用法" target="_blank">String类的基本用法</a> <span class="text-muted">百合不是茶</span> <a class="tag" taget="_blank" href="/search/String/1.htm">String</a> <div>  字符串的用法;     // 根据字节数组创建字符串 byte[] by = { 'a', 'b', 'c', 'd' }; String newByteString = new String(by);         1,length()  获取字符串的长度     &nbs</div> </li> <li><a href="/article/1327.htm" title="JDK1.5 Semaphore实例" target="_blank">JDK1.5 Semaphore实例</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/thread/1.htm">thread</a><a class="tag" taget="_blank" href="/search/java%E5%A4%9A%E7%BA%BF%E7%A8%8B/1.htm">java多线程</a><a class="tag" taget="_blank" href="/search/Semaphore/1.htm">Semaphore</a> <div>Semaphore类        一个计数信号量。从概念上讲,信号量维护了一个许可集合。如有必要,在许可可用前会阻塞每一个 acquire(),然后再获取该许可。每个 release() 添加一个许可,从而可能释放一个正在阻塞的获取者。但是,不使用实际的许可对象,Semaphore 只对可用许可的号码进行计数,并采取相应的行动。 S</div> </li> <li><a href="/article/1454.htm" title="使用GZip来压缩传输量" target="_blank">使用GZip来压缩传输量</a> <span class="text-muted">bijian1013</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/GZip/1.htm">GZip</a> <div>        启动GZip压缩要用到一个开源的Filter:PJL Compressing Filter。这个Filter自1.5.0开始该工程开始构建于JDK5.0,因此在JDK1.4环境下只能使用1.4.6。         PJL Compressi</div> </li> <li><a href="/article/1581.htm" title="【Java范型三】Java范型详解之范型类型通配符" target="_blank">【Java范型三】Java范型详解之范型类型通配符</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div>    定义如下一个简单的范型类,   package com.tom.lang.generics; public class Generics<T> { private T value; public Generics(T value) { this.value = value; } } </div> </li> <li><a href="/article/1708.htm" title="【Hadoop十二】HDFS常用命令" target="_blank">【Hadoop十二】HDFS常用命令</a> <span class="text-muted">bit1129</span> <a class="tag" taget="_blank" href="/search/hadoop/1.htm">hadoop</a> <div>1. 修改日志文件查看器   hdfs oev -i edits_0000000000000000081-0000000000000000089 -o edits.xml cat edits.xml   修改日志文件转储为xml格式的edits.xml文件,其中每条RECORD就是一个操作事务日志   2. fsimage查看HDFS中的块信息等 &nb</div> </li> <li><a href="/article/1835.htm" title="怎样区别nginx中rewrite时break和last" target="_blank">怎样区别nginx中rewrite时break和last</a> <span class="text-muted">ronin47</span> <div>在使用nginx配置rewrite中经常会遇到有的地方用last并不能工作,换成break就可以,其中的原理是对于根目录的理解有所区别,按我的测试结果大致是这样的。 location /    {         proxy_pass http://test; </div> </li> <li><a href="/article/1962.htm" title="java-21.中兴面试题 输入两个整数 n 和 m ,从数列 1 , 2 , 3.......n 中随意取几个数 , 使其和等于 m" target="_blank">java-21.中兴面试题 输入两个整数 n 和 m ,从数列 1 , 2 , 3.......n 中随意取几个数 , 使其和等于 m</a> <span class="text-muted">bylijinnan</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a> <div> import java.util.ArrayList; import java.util.List; import java.util.Stack; public class CombinationToSum { /* 第21 题 2010 年中兴面试题 编程求解: 输入两个整数 n 和 m ,从数列 1 , 2 , 3.......n 中随意取几个数 , 使其和等</div> </li> <li><a href="/article/2089.htm" title="eclipse svn 帐号密码修改问题" target="_blank">eclipse svn 帐号密码修改问题</a> <span class="text-muted">开窍的石头</span> <a class="tag" taget="_blank" href="/search/eclipse/1.htm">eclipse</a><a class="tag" taget="_blank" href="/search/SVN/1.htm">SVN</a><a class="tag" taget="_blank" href="/search/svn%E5%B8%90%E5%8F%B7%E5%AF%86%E7%A0%81%E4%BF%AE%E6%94%B9/1.htm">svn帐号密码修改</a> <div>问题描述:      Eclipse的SVN插件Subclipse做得很好,在svn操作方面提供了很强大丰富的功能。但到目前为止,该插件对svn用户的概念极为淡薄,不但不能方便地切换用户,而且一旦用户的帐号、密码保存之后,就无法再变更了。 解决思路:      删除subclipse记录的帐号、密码信息,重新输入</div> </li> <li><a href="/article/2216.htm" title="[电子商务]传统商务活动与互联网的结合" target="_blank">[电子商务]传统商务活动与互联网的结合</a> <span class="text-muted">comsci</span> <a class="tag" taget="_blank" href="/search/%E7%94%B5%E5%AD%90%E5%95%86%E5%8A%A1/1.htm">电子商务</a> <div>       某一个传统名牌产品,过去销售的地点就在某些特定的地区和阶层,现在进入互联网之后,用户的数量群突然扩大了无数倍,但是,这种产品潜在的劣势也被放大了无数倍,这种销售利润与经营风险同步放大的效应,在最近几年将会频繁出现。。。。        如何避免销售量和利润率增加的</div> </li> <li><a href="/article/2343.htm" title="java 解析 properties-使用 Properties-可以指定配置文件路径" target="_blank">java 解析 properties-使用 Properties-可以指定配置文件路径</a> <span class="text-muted">cuityang</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/properties/1.htm">properties</a> <div>#mq xdr.mq.url=tcp://192.168.100.15:61618; import java.io.IOException; import java.util.Properties; public class Test { String conf = "log4j.properties"; private static final</div> </li> <li><a href="/article/2470.htm" title="Java核心问题集锦" target="_blank">Java核心问题集锦</a> <span class="text-muted">darrenzhu</span> <a class="tag" taget="_blank" href="/search/java/1.htm">java</a><a class="tag" taget="_blank" href="/search/%E5%9F%BA%E7%A1%80/1.htm">基础</a><a class="tag" taget="_blank" href="/search/%E6%A0%B8%E5%BF%83/1.htm">核心</a><a class="tag" taget="_blank" href="/search/%E9%9A%BE%E7%82%B9/1.htm">难点</a> <div>注意,这里的参考文章基本来自Effective Java和jdk源码 1)ConcurrentModificationException 当你用for each遍历一个list时,如果你在循环主体代码中修改list中的元素,将会得到这个Exception,解决的办法是: 1)用listIterator, 它支持在遍历的过程中修改元素, 2)不用listIterator, new一个</div> </li> <li><a href="/article/2724.htm" title="1分钟学会Markdown语法" target="_blank">1分钟学会Markdown语法</a> <span class="text-muted">dcj3sjt126com</span> <a class="tag" taget="_blank" href="/search/markdown/1.htm">markdown</a> <div>markdown 简明语法 基本符号 *,-,+ 3个符号效果都一样,这3个符号被称为 Markdown符号 空白行表示另起一个段落 `是表示inline代码,tab是用来标记 代码段,分别对应html的code,pre标签 换行 单一段落( <p>) 用一个空白行 连续两个空格 会变成一个 <br> 连续3个符号,然后是空行</div> </li> <li><a href="/article/2851.htm" title="Gson使用二(GsonBuilder)" target="_blank">Gson使用二(GsonBuilder)</a> <span class="text-muted">eksliang</span> <a class="tag" taget="_blank" href="/search/json/1.htm">json</a><a class="tag" taget="_blank" href="/search/gson/1.htm">gson</a><a class="tag" taget="_blank" href="/search/GsonBuilder/1.htm">GsonBuilder</a> <div>转载请出自出处:http://eksliang.iteye.com/blog/2175473 一.概述     GsonBuilder用来定制java跟json之间的转换格式   二.基本使用 实体测试类: 温馨提示:默认情况下@Expose注解是不起作用的,除非你用GsonBuilder创建Gson的时候调用了GsonBuilder.excludeField</div> </li> <li><a href="/article/2978.htm" title="报ClassNotFoundException: Didn't find class "...Activity" on path: DexPathList" target="_blank">报ClassNotFoundException: Didn't find class "...Activity" on path: DexPathList</a> <span class="text-muted">gundumw100</span> <a class="tag" taget="_blank" href="/search/android/1.htm">android</a> <div>有一个工程,本来运行是正常的,我想把它移植到另一台PC上,结果报: java.lang.RuntimeException: Unable to instantiate activity ComponentInfo{com.mobovip.bgr/com.mobovip.bgr.MainActivity}: java.lang.ClassNotFoundException: Didn't f</div> </li> <li><a href="/article/3105.htm" title="JavaWeb之JSP指令" target="_blank">JavaWeb之JSP指令</a> <span class="text-muted">ihuning</span> <a class="tag" taget="_blank" href="/search/javaweb/1.htm">javaweb</a> <div>  要点   JSP指令简介  page指令  include指令    JSP指令简介    JSP指令(directive)是为JSP引擎而设计的,它们并不直接产生任何可见输出,而只是告诉引擎如何处理JSP页面中的其余部分。 JSP指令的基本语法格式: <%@ 指令 属性名="</div> </li> <li><a href="/article/3232.htm" title="mac上编译FFmpeg跑ios" target="_blank">mac上编译FFmpeg跑ios</a> <span class="text-muted">啸笑天</span> <a class="tag" taget="_blank" href="/search/ffmpeg/1.htm">ffmpeg</a> <div>1、下载文件:https://github.com/libav/gas-preprocessor, 复制gas-preprocessor.pl到/usr/local/bin/下, 修改文件权限:chmod 777 /usr/local/bin/gas-preprocessor.pl 2、安装yasm-1.2.0 curl http://www.tortall.net/projects/yasm</div> </li> <li><a href="/article/3359.htm" title="sql mysql oracle中字符串连接" target="_blank">sql mysql oracle中字符串连接</a> <span class="text-muted">macroli</span> <a class="tag" taget="_blank" href="/search/oracle/1.htm">oracle</a><a class="tag" taget="_blank" href="/search/sql/1.htm">sql</a><a class="tag" taget="_blank" href="/search/mysql/1.htm">mysql</a><a class="tag" taget="_blank" href="/search/SQL+Server/1.htm">SQL Server</a> <div>有的时候,我们有需要将由不同栏位获得的资料串连在一起。每一种资料库都有提供方法来达到这个目的: MySQL: CONCAT() Oracle: CONCAT(), || SQL Server: + CONCAT() 的语法如下: Mysql 中 CONCAT(字串1, 字串2, 字串3, ...): 将字串1、字串2、字串3,等字串连在一起。 请注意,Oracle的CON</div> </li> <li><a href="/article/3486.htm" title="Git fatal: unab SSL certificate problem: unable to get local issuer ce rtificate" target="_blank">Git fatal: unab SSL certificate problem: unable to get local issuer ce rtificate</a> <span class="text-muted">qiaolevip</span> <a class="tag" taget="_blank" href="/search/%E5%AD%A6%E4%B9%A0%E6%B0%B8%E6%97%A0%E6%AD%A2%E5%A2%83/1.htm">学习永无止境</a><a class="tag" taget="_blank" href="/search/%E6%AF%8F%E5%A4%A9%E8%BF%9B%E6%AD%A5%E4%B8%80%E7%82%B9%E7%82%B9/1.htm">每天进步一点点</a><a class="tag" taget="_blank" href="/search/git/1.htm">git</a><a class="tag" taget="_blank" href="/search/%E7%BA%B5%E8%A7%82%E5%8D%83%E8%B1%A1/1.htm">纵观千象</a> <div>// 报错如下: $ git pull origin master fatal: unable to access 'https://git.xxx.com/': SSL certificate problem: unable to get local issuer ce rtificate   // 原因: 由于git最新版默认使用ssl安全验证,但是我们是使用的git未设</div> </li> <li><a href="/article/3613.htm" title="windows命令行设置wifi" target="_blank">windows命令行设置wifi</a> <span class="text-muted">surfingll</span> <a class="tag" taget="_blank" href="/search/windows/1.htm">windows</a><a class="tag" taget="_blank" href="/search/wifi/1.htm">wifi</a><a class="tag" taget="_blank" href="/search/%E7%AC%94%E8%AE%B0%E6%9C%ACwifi/1.htm">笔记本wifi</a> <div>还没有讨厌无线wifi的无尽广告么,还在耐心等待它慢慢启动么 教你命令行设置 笔记本电脑wifi: 1、开启wifi命令 netsh wlan set hostednetwork mode=allow ssid=surf8 key=bb123456 netsh wlan start hostednetwork pause 其中pause是等待输入,可以去掉 2、</div> </li> <li><a href="/article/3740.htm" title="Linux(Ubuntu)下安装sysv-rc-conf" target="_blank">Linux(Ubuntu)下安装sysv-rc-conf</a> <span class="text-muted">wmlJava</span> <a class="tag" taget="_blank" href="/search/linux/1.htm">linux</a><a class="tag" taget="_blank" href="/search/ubuntu/1.htm">ubuntu</a><a class="tag" taget="_blank" href="/search/sysv-rc-conf/1.htm">sysv-rc-conf</a> <div>安装:sudo apt-get install sysv-rc-conf 使用:sudo sysv-rc-conf 操作界面十分简洁,你可以用鼠标点击,也可以用键盘方向键定位,用空格键选择,用Ctrl+N翻下一页,用Ctrl+P翻上一页,用Q退出。     背景知识 sysv-rc-conf是一个强大的服务管理程序,群众的意见是sysv-rc-conf比chkconf</div> </li> <li><a href="/article/3867.htm" title="svn切换环境,重发布应用多了javaee标签前缀" target="_blank">svn切换环境,重发布应用多了javaee标签前缀</a> <span class="text-muted">zengshaotao</span> <a class="tag" taget="_blank" href="/search/javaee/1.htm">javaee</a> <div>更换了开发环境,从杭州,改变到了上海。svn的地址肯定要切换的,切换之前需要将原svn自带的.svn文件信息删除,可手动删除,也可通过废弃原来的svn位置提示删除.svn时删除。   然后就是按照最新的svn地址和规范建立相关的目录信息,再将原来的纯代码信息上传到新的环境。然后再重新检出,这样每次修改后就可以看到哪些文件被修改过,这对于增量发布的规范特别有用。   检出</div> </li> </ul> </div> </div> </div> <div> <div class="container"> <div class="indexes"> <strong>按字母分类:</strong> <a href="/tags/A/1.htm" target="_blank">A</a><a href="/tags/B/1.htm" target="_blank">B</a><a href="/tags/C/1.htm" target="_blank">C</a><a href="/tags/D/1.htm" target="_blank">D</a><a href="/tags/E/1.htm" target="_blank">E</a><a href="/tags/F/1.htm" target="_blank">F</a><a href="/tags/G/1.htm" target="_blank">G</a><a href="/tags/H/1.htm" target="_blank">H</a><a href="/tags/I/1.htm" target="_blank">I</a><a href="/tags/J/1.htm" target="_blank">J</a><a href="/tags/K/1.htm" target="_blank">K</a><a href="/tags/L/1.htm" target="_blank">L</a><a href="/tags/M/1.htm" target="_blank">M</a><a href="/tags/N/1.htm" target="_blank">N</a><a href="/tags/O/1.htm" target="_blank">O</a><a href="/tags/P/1.htm" target="_blank">P</a><a href="/tags/Q/1.htm" target="_blank">Q</a><a href="/tags/R/1.htm" target="_blank">R</a><a href="/tags/S/1.htm" target="_blank">S</a><a href="/tags/T/1.htm" target="_blank">T</a><a href="/tags/U/1.htm" target="_blank">U</a><a href="/tags/V/1.htm" target="_blank">V</a><a href="/tags/W/1.htm" target="_blank">W</a><a href="/tags/X/1.htm" target="_blank">X</a><a href="/tags/Y/1.htm" target="_blank">Y</a><a href="/tags/Z/1.htm" target="_blank">Z</a><a href="/tags/0/1.htm" target="_blank">其他</a> </div> </div> </div> <footer id="footer" class="mb30 mt30"> <div class="container"> <div class="footBglm"> <a target="_blank" href="/">首页</a> - <a target="_blank" href="/custom/about.htm">关于我们</a> - <a target="_blank" href="/search/Java/1.htm">站内搜索</a> - <a target="_blank" href="/sitemap.txt">Sitemap</a> - <a target="_blank" href="/custom/delete.htm">侵权投诉</a> </div> <div class="copyright">版权所有 IT知识库 CopyRight © 2000-2050 E-COM-NET.COM , All Rights Reserved. <!-- <a href="https://beian.miit.gov.cn/" rel="nofollow" target="_blank">京ICP备09083238号</a><br>--> </div> </div> </footer> <!-- 代码高亮 --> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shCore.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shLegacy.js"></script> <script type="text/javascript" src="/static/syntaxhighlighter/scripts/shAutoloader.js"></script> <link type="text/css" rel="stylesheet" href="/static/syntaxhighlighter/styles/shCoreDefault.css"/> <script type="text/javascript" src="/static/syntaxhighlighter/src/my_start_1.js"></script> </body> </html>