爬虫工具简介:一个简单的工具,用来从某网页下载指定类型的文件。
该工具基于插件BeautifulSoup解析html网页内容,并且可以便捷地指定网页链接,文件类型、下载数量和保存地址。
该工具地实现非常简单,可以快速地下载关于openEuler依赖关系的xml文件。
def getFile(url, file_path):
"""根据url下载文件,保存到本地中,file_path为保存后的文件完整路径"""
u = urlopen(url, timeout=10)
# 在指定文件夹目录下创建文件,以爬取的文件名命名
f = open(file_path, 'wb')
f.write(u.read())
f.close()
print(file_path + '下载成功')
def get(link_url, file_type, num, fold):
link = link_url # 要爬取的网页
Soup = BeautifulSoup(urlopen(link), 'html.parser') # 解析网页
str1 = file_type
i = 0
for l in Soup.findAll('a'):
if i == num:
break
s = l.get("href")
if str1 in s:
# print(s)
i = i + 1
getFile(link + s, fold + s)
if __name__ == '__main__':
link = 'https://repo.openeuler.org/openEuler-20.03-LTS/source/repodata/'
# 指定url
kind = "xml"
# 指定文件种类
num = 100
# 指定爬取数量,从第一个开始
fold = "E:\\store\\"
# 指定存储路径
get(link, kind, num, fold)
为了分析open Euler的包依赖关系而作,它的依赖关系主要描述在primary.xml
文档中。
这是基于xml.dom.minidom进行xml文件分析的工具。首先获取到所有tagName为package
的对象,并提取其版本信息,然后对列表中的每一项依次分析标签名为rpm:entry
即可。
from xml.dom.minidom import parse
import xml.dom.minidom
# 使用minidom解析器打开 XML 文档。
DOMTree = xml.dom.minidom.parse("E:\\store\\523ec3918c22e506e8a58fd5e73c7206dc406820f6cd5276b9e622962a8b79f4-primary"
".xml\\523ec3918c22e506e8a58fd5e73c7206dc406820f6cd5276b9e622962a8b79f4-primary.xml")
collection = DOMTree.documentElement
packages = collection.getElementsByTagName("package")
for p in packages:
print("*****Package*****")
name = p.getElementsByTagName('name')[0]
ver = p.getElementsByTagName('version')[0]
print("name: %s-%s-%s" % (name.childNodes[0].data, ver.getAttribute('ver'), ver.getAttribute('rel')))
rpm_list = p.getElementsByTagName('rpm:entry')
for rpm in rpm_list:
print("req -> %s" % rpm.getAttribute("name"))
<package type="rpm">
<name>CUnitname>
<arch>srcarch>
<version epoch="0" ver="2.1.3" rel="21.oe1"/>
<checksum type="sha256" pkgid="YES">2e9456a32627578dd0d265c917cb656595955b28194c13983a7ee7a60a31f264checksum>
<summary>A Unit Testing Framework for Csummary>
<description>CUnit is a lightweight system for writing, administering, and running unit tests in C
It provides C programmers a basic testing functionality with a flexible variety of user
interfaces.
CUnit is built as a static library which is linked with the user's testing code. It
uses a simple framework for building test structures, and provides a rich set of
assertions for testing common data types. In addition, several different interfaces are
provided for running tests and reporting results.description>
<packager>http://openeuler.orgpackager>
<url>http://cunit.sourceforge.net/url>
<time file="1585233774" build="1584991099"/>
<size package="523357" installed="516791" archive="517172"/>
<location href="Packages/CUnit-2.1.3-21.oe1.src.rpm"/>
<format>
<rpm:license>LGPLv2+rpm:license>
<rpm:vendor>http://openeuler.orgrpm:vendor>
<rpm:group>Unspecifiedrpm:group>
<rpm:buildhost>obs-worker-004rpm:buildhost>
<rpm:sourcerpm/>
<rpm:header-range start="5096" end="7556"/>
<rpm:requires>
<rpm:entry name="libtool"/>
<rpm:entry name="git"/>
<rpm:entry name="automake"/>
rpm:requires>
format>
package>