python模块之feedparser学习使用

    今天在看书的时候无意间发现了一个号东西就是feedparser模块,feedparser 号称是一个 universal feed parser,使用它我们可轻松地实现从任何 RSS 或 Atom 订阅源得到标题、链接和文章的条目了,这个号称并不是说的话,是因为这个模块真的很强大,解压打开后可以直接使用:
    python setup.py  install
   安装使用,也可以使用:pip install feedparser来安装模块
    关于RSS是什么,这个其实我也不清楚,查了资料以后才明白,RSS是RDF Site Summary 的缩写(RDF是Resource Description Framework的缩写 ),是指将网站摘要用xml语言描述。
    如果跟一样都不懂RSS是什么的同学可以读一下 这里,个人感觉总结的还是很详细的。

    好了,不闲聊这些了,因为想知道是什么的话网上输入关键词,一查一大堆的资料就来了,下面看一下我的实践,使用feedparser模块来进行解析过滤页面,返回需要的信息:

    下面是具体的实现:

#!usr/bin/env python
#encoding:utf-8

'''
__Author__:沂水寒城
功能:学习使用feedparser模块
'''

import feedparser


def test(url='http://blog.csdn.net/together_cz/article'):
    '''
    学习使用feedparser
    输入:url
    输出:页面信息
    '''
    one_page_dict = feedparser.parse(url)
    '''
    解析得到的是一个字典
    '''
    print one_page_dict
    '''
    输出字典中的键值有哪些,一共有10中如下:
    ['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception']
    '''
    print one_page_dict.keys()
    print '----------------------------------------------------------'
    print '访问页面链接href为:'
    print one_page_dict['href']
    print '页面返回headers信息为:'
    print one_page_dict['headers']
    print '页面version信息为:'
    print one_page_dict['version']
    print '页面状态码为:'
    print one_page_dict['status']
    print '页面语言类型为:'
    print one_page_dict['feed']['html']['lang']
    print '页面meta信息为:'
    print one_page_dict['feed']['meta']['content']
    print one_page_dict['feed']['meta']['name']



if __name__ == '__main__':
    url_list=['http://www.baidu.com','http://www.vmall.com','http://www.taobao.com']
    for one_url in url_list:
        print '当前url为--->', one_url
        try:
            test(one_url)
        except:
            print '***************************************************************'
        print '----------------------------------------------------------'


结果如下:

当前url为---> http://www.baidu.com
{'feed': {'style': {'type': u'text/css', 'id': u'css_index_result', 'data-for': u'result'}, 'meta': {'content': u'0; url=/baidu.html?from=noscript', 'http-equiv': u'refresh'}, 'summary': u'&&\n    \n    
\n \n \n \n \n&&\n&&\n\n\n\n\n\n\n<>&&&&
<&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&<>&&
&&&&&&&&', 'links': [{'href': u'http://www.baidu.com/favicon.ico', 'type': u'image/x-icon', 'rel': u'shortcut icon'}, {'href': u'http://www.baidu.com/content-search.xml', 'type': u'application/opensearchdescription+xml', 'rel': u'search', 'title': u'\u767e\u5ea6\u641c\u7d22'}, {'mask': u'mask', 'href': u'http://www.baidu.com/img/baidu.svg', 'type': u'text/html', 'rel': u'icon', 'sizes': u'any'}, {'href': u'http://s1.bdstatic.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'http://t1.baidu.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'http://t2.baidu.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'http://t3.baidu.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'http://t10.baidu.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'http://t11.baidu.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'http://t12.baidu.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'http://b1.bdstatic.com', 'type': u'text/html', 'rel': u'dns-prefetch'}], 'script': {'data-compress': u'strip'}}, 'status': 200, 'version': u'', 'encoding': u'utf-8', 'bozo': 1, 'headers': {'bdqid': '0x9105768b00086a8c', 'x-powered-by': 'HPHP', 'transfer-encoding': 'chunked', 'set-cookie': 'BAIDUID=55E1745F997C6FE44DAC108E4969560D:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BIDUPSID=55E1745F997C6FE44DAC108E4969560D; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, PSTM=1500623780; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BDSVRTM=0; path=/, BD_HOME=0; path=/, H_PS_PSSID=1467_21091_20718; path=/; domain=.baidu.com', 'expires': 'Fri, 21 Jul 2017 07:55:49 GMT', 'vary': 'Accept-Encoding', 'bduserid': '0', 'content-encoding': 'gzip', 'connection': 'Close', 'cxy_all': 'baidu+486eb61e24b5f3dcaa17ac1ebe540f35', 'cache-control': 'private', 'date': 'Fri, 21 Jul 2017 07:56:20 GMT', 'p3p': 'CP=" OTI DSP COR IVA OUR IND COM "', 'server': 'BWS/1.1', 'content-type': 'text/html; charset=utf-8', 'bdpagetype': '1', 'x-ua-compatible': 'IE=Edge,chrome=1'}, 'href': u'http://www.baidu.com', 'namespaces': {}, 'entries': [], 'bozo_exception': SAXParseException('not well-formed (invalid token)',)} ['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception'] ---------------------------------------------------------- 访问页面链接href为: http://www.baidu.com 页面返回headers信息为: {'bdqid': '0x9105768b00086a8c', 'x-powered-by': 'HPHP', 'transfer-encoding': 'chunked', 'set-cookie': 'BAIDUID=55E1745F997C6FE44DAC108E4969560D:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BIDUPSID=55E1745F997C6FE44DAC108E4969560D; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, PSTM=1500623780; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com, BDSVRTM=0; path=/, BD_HOME=0; path=/, H_PS_PSSID=1467_21091_20718; path=/; domain=.baidu.com', 'expires': 'Fri, 21 Jul 2017 07:55:49 GMT', 'vary': 'Accept-Encoding', 'bduserid': '0', 'content-encoding': 'gzip', 'connection': 'Close', 'cxy_all': 'baidu+486eb61e24b5f3dcaa17ac1ebe540f35', 'cache-control': 'private', 'date': 'Fri, 21 Jul 2017 07:56:20 GMT', 'p3p': 'CP=" OTI DSP COR IVA OUR IND COM "', 'server': 'BWS/1.1', 'content-type': 'text/html; charset=utf-8', 'bdpagetype': '1', 'x-ua-compatible': 'IE=Edge,chrome=1'} 页面version信息为: 页面状态码为: 200 页面语言类型为: *************************************************************** ---------------------------------------------------------- 当前url为---> http://www.vmall.com {'feed': {'meta': {'content': u'upgrade-insecure-requests', 'http-equiv': u'Content-Security-Policy'}, 'language': u'zh-CN'}, 'status': 301, 'version': u'', 'encoding': u'UTF-8', 'bozo': 1, 'headers': {'content-length': '42466', 'content-language': 'zh-CN', 'content-encoding': 'gzip', 'accept-ranges': 'bytes', 'vary': 'Accept-Encoding', 'server': 'openresty', 'connection': 'close', 'date': 'Fri, 21 Jul 2017 07:56:21 GMT', 'content-type': 'text/html;charset=UTF-8'}, 'href': u'https://www.vmall.com/', 'namespaces': {}, 'entries': [], 'bozo_exception': SAXParseException('not well-formed (invalid token)',)} ['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception'] ---------------------------------------------------------- 访问页面链接href为: https://www.vmall.com/ 页面返回headers信息为: {'content-length': '42466', 'content-language': 'zh-CN', 'content-encoding': 'gzip', 'accept-ranges': 'bytes', 'vary': 'Accept-Encoding', 'server': 'openresty', 'connection': 'close', 'date': 'Fri, 21 Jul 2017 07:56:21 GMT', 'content-type': 'text/html;charset=UTF-8'} 页面version信息为: 页面状态码为: 301 页面语言类型为: *************************************************************** ---------------------------------------------------------- 当前url为---> http://www.taobao.com {'feed': {'style': {'type': u'text/css'}, 'links': [{'href': u'https://g.alicdn.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://img.alicdn.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://tce.alicdn.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://gm.mmstat.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'ref': u'dns-prefetch', 'href': u'https://tce.taobao.com', 'type': u'text/html', 'rel': u'alternate'}, {'href': u'https://log.mmstat.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://tui.taobao.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://ald.taobao.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://gw.alicdn.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://atanx.alicdn.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://dfhs.tanx.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://ecpm.tanx.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://res.mmstat.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://log.mmstat.com', 'type': u'text/html', 'rel': u'dns-prefetch'}, {'href': u'https://img.alicdn.com/tps/i3/T1OjaVFl4dXXa.JOZB-114-114.png', 'type': u'text/html', 'rel': u'apple-touch-icon-precomposed'}], 'meta': {'content': u'\u6dd8\u5b9d,\u638f\u5b9d,\u7f51\u4e0a\u8d2d\u7269,C2C,\u5728\u7ebf\u4ea4\u6613,\u4ea4\u6613\u5e02\u573a,\u7f51\u4e0a\u4ea4\u6613,\u4ea4\u6613\u5e02\u573a,\u7f51\u4e0a\u4e70,\u7f51\u4e0a\u5356,\u8d2d\u7269\u7f51\u7ad9,\u56e2\u8d2d,\u7f51\u4e0a\u8d38\u6613,\u5b89\u5168\u8d2d\u7269,\u7535\u5b50\u5546\u52a1,\u653e\u5fc3\u4e70,\u4f9b\u5e94,\u4e70\u5356\u4fe1\u606f,\u7f51\u5e97,\u4e00\u53e3\u4ef7,\u62cd\u5356,\u7f51\u4e0a\u5f00\u5e97,\u7f51\u7edc\u8d2d\u7269,\u6253\u6298,\u514d\u8d39\u5f00\u5e97,\u7f51\u8d2d,\u9891\u9053,\u5e97\u94fa', 'name': u'keyword'}, 'summary': u'&&&&&>\n\n
\n
\n
\n
\n\n\n\n
\n
\n
\n\n
\n
\n
\n
\n\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n\n\n\n
\n
\n\n
\n
\n\n
\n
\n\n
\n
\n\n\n\n
\n
\n\n
\n
\n
\n
\n

\u5929\u732b\u5fc5\u901b

\u70ed\u95e8\u54c1\u724c,\u5929\u5929\u4e0a\u5929\u732b!1/6\n
\n
\n
\n
\n
\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n\n\n
\n
\n
\n
\n
\n
\n
\u4eca\u65e5\u70ed\u5356
\n
\n\n\n
\n
\n
\n
\n
\n\n
\n
\n
\n\n
\n
\n
\n
\n
\n
\n
\n\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n

\u6709\u597d\u8d27

\n
\n
\n
\n\ue616\u6362\u4e00\u7ec4\u770b\u770b\n
\n
\n
\n
\n
\n
\n
\n
\n

\u4eca\u65e5\u70ed\u5356

\n
\n\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n

\u70ed\u5356\u5355\u54c1

\n
\n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n
    \n\n\n\n\n\n\n\n\n
    \n\n
    \n\n<&&&', 'html': {'lang': u'zh-CN'}, 'link': u'https://tce.taobao.com', 'base': {'target': u'_blank'}}, 'status': 302, 'version': u'', 'encoding': u'utf-8', 'bozo': 1, 'headers': {'via': 'cache2.l2nu16[118,200-0,C], cache16.l2nu16[71,0], cache9.cn345[0,200-0,H], cache4.cn345[1,0]', 'x-swift-cachetime': '90', 'x-cache': 'HIT TCP_MEM_HIT dirn:-2:-2', 'content-encoding': 'gzip', 'transfer-encoding': 'chunked', 'vary': 'Accept-Encoding, Ali-Detector-Type, X-CIP-PT', 'age': '10', 'strict-transport-security': 'max-age=31536000', 'eagleid': '1bdd1e6515006237813128349e', 'server': 'Tengine', 'cache-control': 'max-age=0, s-maxage=90', 'connection': 'close', 'x-swift-savetime': 'Fri, 21 Jul 2017 07:56:11 GMT', 'set-cookie': 'thw=cn; Path=/; Domain=.taobao.com; Expires=Sat, 21-Jul-18 07:56:21 GMT;', 'date': 'Fri, 21 Jul 2017 07:56:21 GMT', 'content-type': 'text/html; charset=utf-8', 'timing-allow-origin': '*'}, 'href': u'https://www.taobao.com/', 'namespaces': {}, 'entries': [], 'bozo_exception': SAXParseException('mismatched tag',)} ['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception'] ---------------------------------------------------------- 访问页面链接href为: https://www.taobao.com/ 页面返回headers信息为: {'via': 'cache2.l2nu16[118,200-0,C], cache16.l2nu16[71,0], cache9.cn345[0,200-0,H], cache4.cn345[1,0]', 'x-swift-cachetime': '90', 'x-cache': 'HIT TCP_MEM_HIT dirn:-2:-2', 'content-encoding': 'gzip', 'transfer-encoding': 'chunked', 'vary': 'Accept-Encoding, Ali-Detector-Type, X-CIP-PT', 'age': '10', 'strict-transport-security': 'max-age=31536000', 'eagleid': '1bdd1e6515006237813128349e', 'server': 'Tengine', 'cache-control': 'max-age=0, s-maxage=90', 'connection': 'close', 'x-swift-savetime': 'Fri, 21 Jul 2017 07:56:11 GMT', 'set-cookie': 'thw=cn; Path=/; Domain=.taobao.com; Expires=Sat, 21-Jul-18 07:56:21 GMT;', 'date': 'Fri, 21 Jul 2017 07:56:21 GMT', 'content-type': 'text/html; charset=utf-8', 'timing-allow-origin': '*'} 页面version信息为: 页面状态码为: 302 页面语言类型为: zh-CN 页面meta信息为: *************************************************************** ---------------------------------------------------------- [Finished in 1.3s]


    你可能感兴趣的:(编程技术,软件工具使用,页面更新识别,python实践,web页面计算)