使用python的Beautiful模块爬取github项目内信息

1:最近在看python中爬虫的东西,已经用了有一会了,想看看别人是怎么写的代码。然后就在github上搜索pachong关键字,查看他们上传的项目(喜欢用拼音做名字可真是个好习惯hh=。=)

2:然后发现好像很多人在写项目的时候都是用写校花网,美女网啥的,当然我肯定是不会承认自己就是想多了解些这个网站的。

3:于是打算写个爬虫,把那一百页中,他们项目中出现过的网站给爬取下来。进行观察。

# *_*coding:utf-8 *_*
import url_Request
import urllib2
import json
from bs4 import BeautifulSoup
import re
import requests
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

def all(url,end_http,num):
    num = num+1
    print("==============================当前是第{}页".format(num))
    header = {
        "Accept": "",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
        "Connection": "keep-alive",
        "Cookie": "",
        "Host": "github.com",
        "If-None-Match": "",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36",
        'content-type': 'charset=utf8',
    }

    req = urllib2.Request(url=url, headers=header)
    res = urllib2.urlopen(url=req).read()

    soup = BeautifulSoup(res, 'html.parser')

    a_all = soup.find_all(name="div", attrs={
        "class": "flex-auto min-width-0 col-10"
    })
    url_list = []

    for i in a_all:
        url = "https://github.com/{}".format(i.a.next_sibling.next_sibling['href'])

        url_list.append(url)

    for url in url_list:

        req = urllib2.Request(url=url, headers=header)
        res = urllib2.urlopen(url=req).read()

        soup0 = BeautifulSoup(res, "html.parser")
        ww = soup0.find_all(name="span")

        http = []
        reText = re.compile(r'http://.*')

        for i in ww:
            ht = re.findall(reText, str(i.get_text()))
            http.append(ht)
        for i in http:
            if len(i) <= 0:
                del (i)
            else:
                end_http.append(i[0])
                continue
    try:
        next_a = soup.find("a",attrs={
            "rel":"next"
        })

        next_url ="https://github.com{}".format(next_a["href"])
        if next_a["href"] != "":
            print(next_url)
            all(next_url,end_http,num)
        else:
            print("未获取下一个href")
    except:
        print("获取下一页地址失败")

if __name__ == '__main__':
    num = 0
    url = "https://github.com/search?l=Python&p=1&q=pa%27chong&type=Code"
    end_http = []
    all(url,end_http,num)

    a = open("/Users/mac/PycharmProjects/0000/URL_Request/http_url", "a")
    for u in end_http:
        a.write("{}\n".format(u))
    a.close()

4:大概的运行样式就是

使用python的Beautiful模块爬取github项目内信息_第1张图片

5:因为我自己也没啥目的性,所以对于网站的检索,就直接查找了全部的http内容了。可以自己增加检索规则或者直接在检索过后的地方再进行一次正则清洗数据

6:获得的数据样式大概就是

使用python的Beautiful模块爬取github项目内信息_第2张图片

很多样式看着很乱,也不打算再进行精细操作了,主要就是想记录下自己写的东西,虽然之前也写过,但是忘了记录一下。hh,

烂是肯定很烂了。没有需求全凭自己想的功能,肯定是我想到啥写啥了,

只做学习用了,大神请掠过。。

你可能感兴趣的:(python必懂基础知识)