开发者日志--基础代码

开发者日志1–爬虫工具

  • 爬虫工具简介:一个简单的工具,用来从某网页下载指定类型的文件。

      该工具基于插件BeautifulSoup解析html网页内容,并且可以便捷地指定网页链接,文件类型、下载数量和保存地址。

      该工具地实现非常简单,可以快速地下载关于openEuler依赖关系的xml文件。

def getFile(url, file_path):
    """根据url下载文件,保存到本地中,file_path为保存后的文件完整路径"""
    u = urlopen(url, timeout=10)
    # 在指定文件夹目录下创建文件,以爬取的文件名命名
    f = open(file_path, 'wb')
    f.write(u.read())
    f.close()
    print(file_path + '下载成功')


def get(link_url, file_type, num, fold):
    link = link_url  # 要爬取的网页
    Soup = BeautifulSoup(urlopen(link), 'html.parser')  # 解析网页
    str1 = file_type
    i = 0
    for l in Soup.findAll('a'):
        if i == num:
            break
        s = l.get("href")
        if str1 in s:
            # print(s)
            i = i + 1
            getFile(link + s, fold + s)


if __name__ == '__main__':
    link = 'https://repo.openeuler.org/openEuler-20.03-LTS/source/repodata/'
    # 指定url
    kind = "xml"
    # 指定文件种类
    num = 100
    # 指定爬取数量,从第一个开始
    fold = "E:\\store\\"
    # 指定存储路径
    get(link, kind, num, fold)

开发者日志2–xml解析工具

  • 为了分析open Euler的包依赖关系而作,它的依赖关系主要描述在primary.xml文档中。

  • 这是基于xml.dom.minidom进行xml文件分析的工具。首先获取到所有tagName为package的对象,并提取其版本信息,然后对列表中的每一项依次分析标签名为rpm:entry即可。

from xml.dom.minidom import parse
import xml.dom.minidom

# 使用minidom解析器打开 XML 文档。
DOMTree = xml.dom.minidom.parse("E:\\store\\523ec3918c22e506e8a58fd5e73c7206dc406820f6cd5276b9e622962a8b79f4-primary"
                                ".xml\\523ec3918c22e506e8a58fd5e73c7206dc406820f6cd5276b9e622962a8b79f4-primary.xml")
collection = DOMTree.documentElement

packages = collection.getElementsByTagName("package")

for p in packages:
    print("*****Package*****")

    name = p.getElementsByTagName('name')[0]
    ver = p.getElementsByTagName('version')[0]
    print("name: %s-%s-%s" % (name.childNodes[0].data, ver.getAttribute('ver'), ver.getAttribute('rel')))

    rpm_list = p.getElementsByTagName('rpm:entry')
    for rpm in rpm_list:
        print("req -> %s" % rpm.getAttribute("name"))
  • 部分的xml文件结构如下。
<package type="rpm">
  <name>CUnitname>
  <arch>srcarch>
  <version epoch="0" ver="2.1.3" rel="21.oe1"/>
  <checksum type="sha256" pkgid="YES">2e9456a32627578dd0d265c917cb656595955b28194c13983a7ee7a60a31f264checksum>
  <summary>A Unit Testing Framework for Csummary>
  <description>CUnit is a lightweight system for writing, administering, and running unit tests in C
It provides C programmers a basic testing functionality with a flexible variety of user
interfaces.

CUnit is built as a static library which is linked with the user's testing code.  It
uses a simple framework for building test structures, and provides a rich set of
assertions for testing common data types. In addition, several different interfaces are
provided for running tests and reporting results.description>
  <packager>http://openeuler.orgpackager>
  <url>http://cunit.sourceforge.net/url>
  <time file="1585233774" build="1584991099"/>
  <size package="523357" installed="516791" archive="517172"/>
<location href="Packages/CUnit-2.1.3-21.oe1.src.rpm"/>
  <format>
    <rpm:license>LGPLv2+rpm:license>
    <rpm:vendor>http://openeuler.orgrpm:vendor>
    <rpm:group>Unspecifiedrpm:group>
    <rpm:buildhost>obs-worker-004rpm:buildhost>
    <rpm:sourcerpm/>
    <rpm:header-range start="5096" end="7556"/>
    <rpm:requires>
      <rpm:entry name="libtool"/>
      <rpm:entry name="git"/>
      <rpm:entry name="automake"/>
    rpm:requires>
  format>
package>

你可能感兴趣的:(python,开源软件)