python--爬虫入门(七)urllib库初体验以及中文编码问题的探讨

python系列均基于python3.4环境

---------@_@? --------------------------------------------------------------------

  • 提出问题:如何简单抓取一个网页的源码
  • 解决方法:利用urllib库,抓取一个网页的源代码

------------------------------------------------------------------------------------

  • 代码示例
#python3.4
import urllib.request

response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
print(response.read())
  • 运行结果
b'\nDOCTYPE html>\n<html>\n<head>\n    <meta charset="utf-8"/>\n    <title>\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b - \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xadtitle>    \n    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>\n    <meta content="\xe6\x8a\x80\xe6\x9c\xaf\xe6\x90\x9c\xe7\xb4\xa2,IT\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe6\x90\x9c\xe7\xb4\xa2,\xe4\xbb\xa3\xe7\xa0\x81\xe6\x90\x9c\xe7\xb4\xa2,\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e" name="keywords" />\n    <meta content="\xe9\x9d\xa2\xe5\x90\x91\xe7\xa8\x8b\xe5\xba\x8f\xe5\x91\x98\xe7\x9a\x84\xe4\xb8\x93\xe4\xb8\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x82\xe9\x81\x87\xe5\x88\xb0\xe6\x8a\x80\xe6\x9c\xaf\xe9\x97\xae\xe9\xa2\x98\xe6\x80\x8e\xe4\xb9\x88\xe5\x8a\x9e\xef\xbc\x8c\xe5\x88\xb0\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8b..." name="description" />\n    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />\n    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript">script>\n    <script src="/Scripts/Common.js" type="text/javascript">script>\n    <script src="/Scripts/Home.js" type="text/javascript">script>\nhead>\n<body>\n    <div class="top">\n        \n        <div class="top_tabs">\n            <a href="http://www.cnblogs.com">\xc2\xab \xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xad\xe9\xa6\x96\xe9\xa1\xb5 a>\n        div>\n        <div id="span_userinfo" class="top_links">\n        div>\n    div>\n    <div style="clear: both">\n    div>\n    <center>\n        <div id="main">\n            <div class="logo_index">\n                <a href="http://zzk.cnblogs.com">\n                    <img alt="\xe6\x89\xbe\xe6\x89\xbe\xe7\x9c\x8blogo" src="/images/logo.gif" />a>\n            div>\n            <div class="index_sozone">\n                <div class="index_tab">\n                    <a href="/n" onclick="return  channelSwitch('n');">\xe6\x96\xb0\xe9\x97\xbba>\n<a class="tab_selected" href="/b" onclick="return  channelSwitch('b');">\xe5\x8d\x9a\xe5\xae\xa2a>                    <a href="/k" onclick="return  channelSwitch('k');">\xe7\x9f\xa5\xe8\xaf\x86\xe5\xba\x93a>\n                    <a href="/q" onclick="return  channelSwitch('q');">\xe5\x8d\x9a\xe9\x97\xaea>\n                div>\n                <div class="search_block">\n                    <div class="index_btn">\n                        <input type="button" class="btn_so_index" onclick="Search();" value=" \xe6\x89\xbe\xe4\xb8\x80\xe4\xb8\x8b " />\n                        <span class="help_link"><a target="_blank" href="/help">\xe5\xb8\xae\xe5\x8a\xa9a>span>\n                    div>\n                    <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />\n                div>\n            div>\n        div>\n        <div class="footer">\n            ©2004-2016 <a href="http://www.cnblogs.com">\xe5\x8d\x9a\xe5\xae\xa2\xe5\x9b\xada>\n        div>\n    center>\nbody>\nhtml>\n'
  • 附上python2.7的实现代码:
#python2.7
import urllib2
 
response = urllib2.urlopen("http://zzk.cnblogs.com/b")
print response.read()
  • 可见,python3.4和python2.7的代码存在差异性。

 

----------@_@? 问题出现!----------------------------------------------------------------------

  • 发现问题:查看上面的运行结果,会发现中文并没有正常显示。
  • 解决问题:处理中文编码问题

--------------------------------------------------------------------------------------------------

 

  • 处理源码中的中文问题!!!
  • 修改代码,如下:
#python3.4
import urllib.request

response = urllib.request.urlopen("http://zzk.cnblogs.com/b")
print(response.read().decode('UTF-8'))
  • 运行,结果显示:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py

DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>找找看 - 博客园title>    
    <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    <meta content="技术搜索,IT搜索,程序搜索,代码搜索,程序员搜索引擎" name="keywords" />
    <meta content="面向程序员的专业搜索引擎。遇到技术问题怎么办,到博客园找找看..." name="description" />
    <link type="text/css" href="/Content/Style.css" rel="stylesheet" />
    <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript">script>
    <script src="/Scripts/Common.js" type="text/javascript">script>
    <script src="/Scripts/Home.js" type="text/javascript">script>
head>
<body>
    <div class="top">
        
        <div class="top_tabs">
            <a href="http://www.cnblogs.com">« 博客园首页 a>
        div>
        <div id="span_userinfo" class="top_links">
        div>
    div>
    <div style="clear: both">
    div>
    <center>
        <div id="main">
            <div class="logo_index">
                <a href="http://zzk.cnblogs.com">
                    <img alt="找找看logo" src="/images/logo.gif" />a>
            div>
            <div class="index_sozone">
                <div class="index_tab">
                    <a href="/n" onclick="return  channelSwitch('n');">新闻a>
<a class="tab_selected" href="/b" onclick="return  channelSwitch('b');">博客a>                    <a href="/k" onclick="return  channelSwitch('k');">知识库a>
                    <a href="/q" onclick="return  channelSwitch('q');">博问a>
                div>
                <div class="search_block">
                    <div class="index_btn">
                        <input type="button" class="btn_so_index" onclick="Search();" value=" 找一下 " />
                        <span class="help_link"><a target="_blank" href="/help">帮助a>span>
                    div>
                    <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" />
                div>
            div>
        div>
        <div class="footer">
            ©2004-2016 <a href="http://www.cnblogs.com">博客园a>
        div>
    center>
body>
html>


Process finished with exit code 0
  • 结果显示:处理完编码后,网页源码中中文可以正常显示了

 

 

-----------@_@! 探讨一个新的中文编码问题 ----------------------------------------------------------

   问题:“如果url中出现中文,那么应该如果解决呢?”

   例如:url = "http://zzk.cnblogs.com/s?w=python爬虫&t=b"

  python--爬虫入门(七)urllib库初体验以及中文编码问题的探讨_第1张图片

-----------------------------------------------------------------------------------------------------

 

  • 接下来,我们来解决url中出现中文的问题!!!

(1)测试1:保留原来的格式,直接访问,不做任何处理

  • 代码示例:
#python3.4
import urllib.request

url="http://zzk.cnblogs.com/s?w=python爬虫&t=b"
resp = urllib.request.urlopen(url)
print(resp.read().decode('UTF-8'))
  • 运行结果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Traceback (most recent call last):
  File "E:/pythone_workspace/mydemo/spider/demo.py", line 9, in 
    response = urllib.request.urlopen(url)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 463, in open
    response = self._open(req, data)
  File "C:\Python34\lib\urllib\request.py", line 481, in _open
    '_open', req)
  File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
    result = func(*args)
  File "C:\Python34\lib\urllib\request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python34\lib\urllib\request.py", line 1182, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python34\lib\http\client.py", line 1088, in request
    self._send_request(method, url, body, headers)
  File "C:\Python34\lib\http\client.py", line 1116, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python34\lib\http\client.py", line 973, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)

Process finished with exit code 1

  果然不行!!!

 

(2)测试2:中文单独处理

  • 代码示例:
import urllib.request
import urllib.parse

url = "http://zzk.cnblogs.com/s?w=python"+ urllib.parse.quote("爬虫")+"&t=b"
resp = urllib.request.urlopen(url)
print(resp.read().decode('utf-8'))
  • 运行结果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py


<head>
    "utf-8" />
    python爬虫-博客园找找看
    "shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/>
    "/Content/so.css?id=20140908" rel="stylesheet" type="text/css" />
    "/Content/jquery-ui-1.8.21.custom.css" rel="stylesheet" type="text/css" />
    
    
    
    
    
head>

    
"top_bar">
id="span_userinfo">
id="header">
id="headerMain"> id="logo" href="/">
id="searchBox">
"seachInput"> "text" οnchange="ShowtFilter(this, false);" οnkeypress="return searchEnter(event);" value="python爬虫" name="w" id="w" maxlength="2048" title="博客园 找找看" class="txtSeach" /> "button" value="找一下" class="btnSearch" οnclick="Search();" />  &nbsp; "help_link">"_blank" href="/help">帮助
"clear: both">
id="searchInfo"> "float: left; margin-left: 15px;">博客园找找看,找到相关内容id="CountOfResults">1491篇,用时132毫秒
id="main">
id="searchResult">
"clear: both">
"forflow">
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/hearzeus/p/5238867.html">Python 爬虫入门——小项目实战(自动私信博客园某篇博客下的评论人,随机发送一条笑话,完整代码在博文最后)

"searchCon"> python, 爬虫,  之前写的都是针对爬虫过程中遇到问题...55561   python代码如下: def getCo...通过关键特征告诉爬虫,已经遍历结束了。我用的特征代码如下: ...定时器     python定时器,代码示例: impor
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/hearzeus/" target="_blank">不剃头的一休哥 "searchItemInfo-publishDate">2016-03-03 "searchItemInfo-good">推荐(12) "searchItemInfo-comments">评论(55) "searchItemInfo-views">浏览(1582)
"searchItemInfo"> "searchURL">www.cnblogs.com/hearzeus/p/5238867.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/hearzeus/p/5151449.html">Python 爬虫入门(一)

"searchCon"> python, 爬虫,  毕设是做爬虫相关的,本来想的是用j...太满意。之前听说Python这方面比较强,就想用Python...至此,一个简单的爬虫就完成了。之后是针对反爬虫的一些策略,比...a写,也写了几个爬虫,其中一个是爬网易云音乐的用户信息,爬了
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/hearzeus/" target="_blank">不剃头的一休哥 "searchItemInfo-publishDate">2016-01-22 "searchItemInfo-good">推荐(1) "searchItemInfo-comments">评论(13) "searchItemInfo-views">浏览(1493)
"searchItemInfo"> "searchURL">www.cnblogs.com/hearzeus/p/5151449.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/xueweihan/p/4592212.html">[Python]新手写爬虫全过程(已完成)

"searchCon"> hool.cc/python/python-files-io...python, 爬虫,今天早上起来,第一件事情就是理一理今天...任务,写一个只用python字符串内建函数的爬虫,定义为v1...实主要的不是学习爬虫,而是依照这个需求锻炼下自己的编程能力,
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/xueweihan/" target="_blank">削微寒 "searchItemInfo-publishDate">2015-06-21 "searchItemInfo-good">推荐(13) "searchItemInfo-comments">评论(11) "searchItemInfo-views">浏览(2405)
"searchItemInfo"> "searchURL">www.cnblogs.com/xueweihan/p/4592212.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/hearzeus/p/5157016.html">Python 爬虫入门(二)—— IP代理使用

"searchCon"> 的代理。   在爬虫中,有些网站可能为了防止爬虫或者DDOS...python, 爬虫,  上一节,大概讲述了Python 爬...所以,我们可以用爬虫爬那么IP。用上一节的代码,完全可以做到...(;;)这样的。python中的for循环,in 表示X的取
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/hearzeus/" target="_blank">不剃头的一休哥 "searchItemInfo-publishDate">2016-01-25 "searchItemInfo-good">推荐(3) "searchItemInfo-comments">评论(21) "searchItemInfo-views">浏览(1893)
"searchItemInfo"> "searchURL">www.cnblogs.com/hearzeus/p/5157016.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/ruthon/p/4638262.html">《零基础写Python爬虫》系列技术文章整理收藏

"searchCon"> Python,《零基础写Python爬虫》系列技术文章整理收... 1零基础写python爬虫爬虫的定义及URL构成ht...ml 8零基础写python爬虫爬虫编写全记录http:/...ml 9零基础写python爬虫爬虫框架Scrapy安装配
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/ruthon/" target="_blank">豆芽ruthon "searchItemInfo-publishDate">2015-07-11
"searchItemInfo"> "searchURL">www.cnblogs.com/ruthon/p/4638262.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/wenjianmuran/p/5049966.html">Python爬虫入门案例:获取百词斩已学单词列表

"searchCon"> 记不住。我们来用Python来爬取这些信息,同时学习Python爬虫基础。 首先...Python, 案例, 百词斩是一款很不错的单词记忆APP,在学习过程中,它会记录你所学的每...n) 如果要在Python中解析json,我们需要json库。我们打印下前两页
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/wenjianmuran/" target="_blank">文剑木然 "searchItemInfo-publishDate">2015-12-16 "searchItemInfo-good">推荐(12) "searchItemInfo-comments">评论(4) "searchItemInfo-views">浏览(1235)
"searchItemInfo"> "searchURL">www.cnblogs.com/wenjianmuran/p/5049966.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/cs-player1/p/5169307.html">python爬虫之初体验

"searchCon"> python, 爬虫,上网简单看了几篇博客自己试了试简单的爬虫哎呦喂很有感觉蛮好玩的 之前写博客 有点感觉是在写教程啊什么的写的很别扭 各种复制粘贴写得很不舒服 以后还是怎么舒服怎么写把每天的练习所得写上来就好了本来就是个菜鸟不断学习 不断debug就好 直接
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/cs-player1/" target="_blank">cs-player1 "searchItemInfo-publishDate">2016-01-29 "searchItemInfo-good">推荐(1) "searchItemInfo-comments">评论(14) "searchItemInfo-views">浏览(798)
"searchItemInfo"> "searchURL">www.cnblogs.com/cs-player1/p/5169307.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/hearzeus/p/5226546.html">Python 爬虫入门(四)—— 验证码下篇(破解简单的验证码)

"searchCon"> python, 爬虫,  年前写了验证码上篇,本来很早前就想写下篇来着,只是过年比较忙,还有就是验证码破解比较繁杂,方法不同,正确率也会有差...码(这里我用的是python"PIL"图像处理库)    a.)转为灰度图     PIL 在这方面也提供了极完备的支
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/hearzeus/" target="_blank">不剃头的一休哥 "searchItemInfo-publishDate">2016-02-29 "searchItemInfo-good">推荐(7) "searchItemInfo-comments">评论(17) "searchItemInfo-views">浏览(888)
"searchItemInfo"> "searchURL">www.cnblogs.com/hearzeus/p/5226546.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/xin-xin/p/4297852.html">《Python爬虫学习系列教程》学习笔记

"searchCon"> 家的交流。 一、Python入门 1. Python爬虫入门...一之综述 2. Python爬虫入门二之爬虫基础了解 3. ... Python爬虫入门七之正则表达式 二、Python实战 ...on进阶 1. Python爬虫进阶一之爬虫框架Scrapy
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/xin-xin/" target="_blank">心_心 "searchItemInfo-publishDate">2015-02-23 "searchItemInfo-good">推荐(3) "searchItemInfo-comments">评论(2) "searchItemInfo-views">浏览(34430)
"searchItemInfo"> "searchURL">www.cnblogs.com/xin-xin/p/4297852.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/nishuihan/p/4754622.html">PHP, Python, Node.js 哪个比较适合写爬虫

"searchCon"> 子,做一个简单的爬虫容易,但要做一个完备的爬虫挺难的。像我搭...path的类库/爬虫库后,就会发现此种方式虽然入门门槛低,但...荐采用一些现成的爬虫库,诸如xpath、多线程支持还是必须考...以考虑。3、如果爬虫是涉及大规模网站爬取,效率、扩展性、可维
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/nishuihan/" target="_blank">技术宅小牛牛 "searchItemInfo-publishDate">2015-08-24
"searchItemInfo"> "searchURL">www.cnblogs.com/nishuihan/p/4754622.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/nishuihan/p/4815930.html">PHP, Python, Node.js 哪个比较适合写爬虫

"searchCon"> 子,做一个简单的爬虫容易,但要做一个完备的爬虫挺难的。像我搭...主要看你定义的“爬虫”干什么用。1、如果是定向爬取几个页面,...path的类库/爬虫库后,就会发现此种方式虽然入门门槛低,但...荐采用一些现成的爬虫库,诸如xpath、多线程支持还是必须考
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/nishuihan/" target="_blank">技术宅小牛牛 "searchItemInfo-publishDate">2015-09-17
"searchItemInfo"> "searchURL">www.cnblogs.com/nishuihan/p/4815930.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/rwxwsblog/p/4557123.html">安装python爬虫scrapy踩过的那些坑和编程外的思考

"searchCon"> 了一下开源的爬虫资料,看了许多对于开源爬虫的比较发现开源爬虫...没办法,只能升级python的版本了。 1、升级python...s://www.python.org/ftp/python/...n 检查python版本 python --ve
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/rwxwsblog/" target="_blank">秋楓 "searchItemInfo-publishDate">2015-06-06 "searchItemInfo-good">推荐(2) "searchItemInfo-comments">评论(1) "searchItemInfo-views">浏览(4607)
"searchItemInfo"> "searchURL">www.cnblogs.com/rwxwsblog/p/4557123.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/maybe2030/p/4555382.html">[Python] 网络爬虫和正则表达式学习总结

"searchCon"> 有的网站为了防止爬虫,可能会拒绝爬虫的请求,这就需要我们来修...,正则表达式不是Python的语法,并不属于Python,其...\d" 2.2 Python的re模块   Python通过... 实例描述 python 匹配 "python".
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/maybe2030/" target="_blank">poll的笔记 "searchItemInfo-publishDate">2015-06-05 "searchItemInfo-good">推荐(2) "searchItemInfo-comments">评论(5) "searchItemInfo-views">浏览(1089)
"searchItemInfo"> "searchURL">www.cnblogs.com/maybe2030/p/4555382.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/mr-zys/p/5059451.html">一个简单的多线程Python爬虫(一)

"searchCon"> 一个简单的多线程Python爬虫 最近想要抓取[拉勾网](h...自己写一个简单的Python爬虫的想法。 本文中的部分链接...0525185/python-threading-how-d...0525185/python-threading-how-do-i-lock-a-thread) ## 一个爬虫
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/mr-zys/" target="_blank">mr_zys "searchItemInfo-publishDate">2015-12-19 "searchItemInfo-good">推荐(3) "searchItemInfo-comments">评论(4) "searchItemInfo-views">浏览(696)
"searchItemInfo"> "searchURL">www.cnblogs.com/mr-zys/p/5059451.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/jixin/p/5145813.html">自学Python十一 Python爬虫总结

"searchCon"> Demo   爬虫就靠一段落吧,更深入的爬虫框架以及htm...学习与尝试逐渐对python爬虫有了一些小小的心得,我们渐渐...尝试着去总结一下爬虫的共性,试着去写个helper类以避免重...。   参考:用python爬虫抓站的一些技巧总结 zz  
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/jixin/" target="_blank">我的代码会飞 "searchItemInfo-publishDate">2016-01-20 "searchItemInfo-good">推荐(3) "searchItemInfo-comments">评论(1) "searchItemInfo-views">浏览(696)
"searchItemInfo"> "searchURL">www.cnblogs.com/jixin/p/5145813.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/hearzeus/p/5162691.html">Python 爬虫入门(三)—— 寻找合适的爬取策略

"searchCon"> python, 爬虫,  写爬虫之前,首先要明确爬取的数据。...怎么寻找一个好的爬虫策略。(代码仅供学习交流,切勿用作商业或...(这个也是我们用爬虫发请求的结果),如图所示      很庆...).顺便说一句,python有json解析模块,可以用。   下面附上蝉游记的爬虫
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/hearzeus/" target="_blank">不剃头的一休哥 "searchItemInfo-publishDate">2016-01-27 "searchItemInfo-good">推荐(5) "searchItemInfo-comments">评论(3) "searchItemInfo-views">浏览(799)
"searchItemInfo"> "searchURL">www.cnblogs.com/hearzeus/p/5162691.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/ybjourney/p/5304501.html">python简单爬虫

"searchCon">   爬虫真是一件有意思的事儿啊,之前写过爬虫,用的是urll...Soup实现简单爬虫,scrapy也有实现过。最近想更好的学...习爬虫,那么就尽可能的做记录吧。这篇博客就我今天的一个学习过...的语法规则,我在爬虫中常用的有: . 匹配任意字符(换
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/ybjourney/" target="_blank">oyabea "searchItemInfo-publishDate">2016-03-22 "searchItemInfo-good">推荐(4) "searchItemInfo-comments">评论(1) "searchItemInfo-views">浏览(477)
"searchItemInfo"> "searchURL">www.cnblogs.com/ybjourney/p/5304501.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/hippieZhou/p/4967075.html">Python带你轻松进行网页爬虫

"searchCon"> ,所以就打算自学Python。在还没有学它的时候就听说用它来进行网页爬虫...3.0这次的网络爬虫需求背景我打算延续DotNet开源大本营...例。2.实战网页爬虫2.1.获取城市列表:首先,我们需要获...行速度,那么可能Python还是挺适合的,毕竟可以通过它写更
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/hippiezhou/" target="_blank">hippiezhou "searchItemInfo-publishDate">2015-11-22 "searchItemInfo-good">推荐(2) "searchItemInfo-comments">评论(2) "searchItemInfo-views">浏览(1563)
"searchItemInfo"> "searchURL">www.cnblogs.com/hippieZhou/p/4967075.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/mfryf/p/3695844.html">开发记录_自学Python爬虫程序爬取csdn个人博客信息

"searchCon"> .3_开工 据说Python并不难,看过了python的代码...lecd这 个半爬虫半网站的项目, 累积不少爬虫抓站的经验,... 某些网站反感爬虫的到访,于是对爬虫一律拒绝请求 ...模仿了一个自己的Python爬虫。 [python]
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/mfryf/" target="_blank">知识天地 "searchItemInfo-publishDate">2014-04-28 "searchItemInfo-good">推荐(1) "searchItemInfo-comments">评论(1) "searchItemInfo-views">浏览(4481)
"searchItemInfo"> "searchURL">www.cnblogs.com/mfryf/p/3695844.html
"searchItem">

"searchItemTitle"> "_blank" href="http://www.cnblogs.com/coltfoal/archive/2012/10/06/2713348.html">Python天气预报采集器(网页爬虫

"searchCon"> 的。   补充上爬虫结果的截图:      python的使...编程, Python,  python是一门很强大的语言,在...以就算了。   爬虫简单说来包括两个步骤:获得网页文本、过滤...ml文本。   python在获取html方面十分方便,寥寥
"searchItemInfo"> "searchItemInfo-userName"> "http://www.cnblogs.com/coltfoal/" target="_blank">coltfoal "searchItemInfo-publishDate">2012-10-06 "searchItemInfo-good">推荐(5) "searchItemInfo-comments">评论(16) "searchItemInfo-views">浏览(5412)
"searchItemInfo"> "searchURL">www.cnblogs.com/coltfoal/archive/2012/10/06/2713348.html
id="paging_block">
"pager">"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=1" class="p_1 current" οnclick="Return true;;buildPaging(1);return false;">1"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=2" class="p_2" οnclick="Return true;;buildPaging(2);return false;">2"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=3" class="p_3" οnclick="Return true;;buildPaging(3);return false;">3"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=4" class="p_4" οnclick="Return true;;buildPaging(4);return false;">4"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=5" class="p_5" οnclick="Return true;;buildPaging(5);return false;">5"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=6" class="p_6" οnclick="Return true;;buildPaging(6);return false;">6"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=7" class="p_7" οnclick="Return true;;buildPaging(7);return false;">7"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=8" class="p_8" οnclick="Return true;;buildPaging(8);return false;">8"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=9" class="p_9" οnclick="Return true;;buildPaging(9);return false;">9"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=10" class="p_10" οnclick="Return true;;buildPaging(10);return false;">10"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=11" class="p_11" οnclick="Return true;;buildPaging(11);return false;">11"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=12" class="p_12" οnclick="Return true;;buildPaging(12);return false;">12"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=13" class="p_13" οnclick="Return true;;buildPaging(13);return false;">13"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=14" class="p_14" οnclick="Return true;;buildPaging(14);return false;">14"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=15" class="p_15" οnclick="Return true;;buildPaging(15);return false;">15"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=16" class="p_16" οnclick="Return true;;buildPaging(16);return false;">16"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=17" class="p_17" οnclick="Return true;;buildPaging(17);return false;">17"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=18" class="p_18" οnclick="Return true;;buildPaging(18);return false;">18"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=19" class="p_19" οnclick="Return true;;buildPaging(19);return false;">19"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=20" class="p_20" οnclick="Return true;;buildPaging(20);return false;">20"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=21" class="p_21" οnclick="Return true;;buildPaging(21);return false;">21"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=22" class="p_22" οnclick="Return true;;buildPaging(22);return false;">22"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=23" class="p_23" οnclick="Return true;;buildPaging(23);return false;">23···"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=75" class="p_75" οnclick="Return true;;buildPaging(75);return false;">75"/s?w=python%e7%88%ac%e8%99%ab&t=b&p=2" οnclick="Return true;;buildPaging(2);return false;">Next >
"forflow" id="sidebar">
"clear: both;">
"clear: both;">
"clear: both;">
"clear: both;">
id="siderigt_ad">
id='div-gpt-ad-1410172170550-0' style='width:300px; height:250px;'>
"clear: both;">
Process finished with exit code 0
运行结果
  • 结果显示:对url中的中文进行单独处理,url对应内容可以正常抓取了

 

------@_@! 又有一个新的问题-----------------------------------------------------------

  • 问题:如果把url的中英文一起进行处理呢?还能成功抓取吗?

----------------------------------------------------------------------------------------

(3)于是,测试3出现了!测试3:url中,中英文一起进行处理

  • 代码示例:
#python3.4
import urllib.request
import urllib.parse

url = urllib.parse.quote("http://zzk.cnblogs.com/s?w=python爬虫&t=b")
resp = urllib.request.urlopen(url)
print(resp.read().decode('utf-8'))
  • 运行结果:
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Traceback (most recent call last):
  File "E:/pythone_workspace/mydemo/spider/demo.py", line 21, in 
    resp = urllib.request.urlopen(url)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 448, in open
    req = Request(fullurl, data)
  File "C:\Python34\lib\urllib\request.py", line 266, in __init__
    self.full_url = url
  File "C:\Python34\lib\urllib\request.py", line 292, in full_url
    self._parse()
  File "C:\Python34\lib\urllib\request.py", line 321, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'http%3A//zzk.cnblogs.com/s%3Fw%3Dpython%E7%88%AC%E8%99%AB%26t%3Db'

Process finished with exit code 1
  • 结果显示:ValueError!无法成功抓取网页!

 

  • 结合测试1、2、3,可得到下面结果:

(1)在python3.4中,如果url中包含中文,可以用 urllib.parse.quote("爬虫") 进行处理。

(2)url中的中文需要单独处理,不能中英文一起处理。

 

  • Tips:如果想了解一个函数的参数传值
#python3.4
import urllib.request
help(urllib.request.urlopen)
  • 运行上面代码,控制台输出
C:\Python34\python.exe E:/pythone_workspace/mydemo/spider/demo.py
Help on function urlopen in module urllib.request:

urlopen(url, data=None, timeout=<object object at 0x00A50490>, *, cafile=None, capath=None, cadefault=False, context=None)

Process finished with exit code 0

 

  @_@)Y,这篇的分享就到此结束~待续~

转载于:https://www.cnblogs.com/lmei/p/5333644.html

你可能感兴趣的:(python--爬虫入门(七)urllib库初体验以及中文编码问题的探讨)