利用Python爬取github上commits信息

爬取github上commits在1200次以上的用户及commits分布情况

    • 简介
    • 准备
    • 抓取用户个人页面
    • 获取commits信息
    • 打印符合条件用户最近一周commits信息
    • 反爬虫问题
    • 总结

简介

前段时间,帮同学做了一个爬虫的作业,比较基础的那种,这里简单记录一下吧。要爬取的内容就是github上commits在1200次以上的人及其commits分布情况。完整代码下载地址:https://pan.baidu.com/s/1BUE5WqrCQ4PDEsS8aBX7Vg

准备

  1. 使用Python3.6
  2. 使用requests和BeautifulSoup库(不会用的可以看一下这两个中文文档requests和BeautifulSoup)

抓取用户个人页面

查看网址可以发现github用户页面网址格式都是"https://github.com/" + 用户ID,用户的几个页面格式是"https://github.com/" + 用户ID + “?tab=…”,这样我们就可以从一个用户开始,根据Followers和Following关系,爬取到大量的用户信息,我这里采用BFS(广度优先搜索)的思想,从一个用户ID为"u2"的用户开始,一层层的爬取用户的Following名单,并存到用户列表里备选。
代码如下:

#获取当前用户关注者列表下的用户名,放进备选用户列表里,并返回队尾
#函数参数为用户个人界面的url和队尾序号
def get_user_following(userFolloingUrl, userListTail):    
    req = requests.get(userFollowingUrl)
    html = req.text
    soup = BeautifulSoup(html,"lxml")
    userFollowings = soup.find_all('span', class_ = 'f4 link-gray-dark')
    for element in userFollowings:
        userName = element.text.replace('\xa0'*8,'\n\n') 
        if userName != "":
            userList.append(userName)
            userListTail = userListTail + 1
    return userListTail

获取commits信息

通过观察网址特点可以发现,每个用户每个月的提交信息网页url为:“https://github.com/” + 用户名 + “?tab=overview&from=” + 起始日期+"&to=" + 结束日期,以此特点遍历每个备选用户近七年每年十二个月的提交信息,确定其commits数量是否超过1200次。
代码如下:

while userListHead < userListTail:
    if userDataNum >= 20 :
        break
    userName = userList[userListHead]
    userListHead = userListHead + 1
    userFollowingUrl = "https://github.com/"+userName+"?tab=following"
    userListTail = get_user_following(userFollowingUrl, userListTail)
    time.sleep(int(random.uniform(2,4)))
    commitsNum = 0
    
    print("Get the user's following successfully")
   
    #遍历该用户最近七年每个月的commits记录
    for year in ["2018","2017","2016","2015","2014","2013","2012"]:
        for month in range(12,0,-1):
            eachMonthCommitsUrl = "https://github.com/"+userName+"?tab=overview&from="+year+startDate[month]+"&to="+year+endDate[month]
            time.sleep(int(random.uniform(2,4)))
            mainReq = requests.get(eachMonthCommitsUrl)
            mainHtml = mainReq.text
            mainSoup = BeautifulSoup(mainHtml,"lxml")
            monthProjectCommits = mainSoup.find_all('a', class_ = 'f6 muted-link ml-1')
            #统计当月commits数量
            for element in monthProjectCommits:
                curNum = 0
                curCommitsNum = str(element.text.replace('\xa0'*8,'\n\n')) 
                for letter in curCommitsNum:
                    if letter >= "0" and letter <= "9":
                        curNum = curNum*10 + int(letter)
                    else:
                        continue
                commitsNum = commitsNum + curNum
            #简单剪枝
            if (commitsNum >= 1200) or (year == "2017" and curCommitsNum == 0):
                break
        if (commitsNum >= 1200) or (year == "2017" and curCommitsNum == 0):
                break
                           
    print("commitsNum is " + str(commitsNum))
    #找到一个符合条件用户存储起来
    if commitsNum >= 1200:    
        usersFile.write(userName+"\n")
        users.append(userName)
        userDataNum = userDataNum + 1
        usersFile.close()
print("Find 20 users!")

打印符合条件用户最近一周commits信息

对于每一个符合条件用户即commits在1200次以上的用户,爬取他最近一周的commits记录。
代码如下:

#打印符合条件用户最近一星期(即2018.12.23-2018.12.29)的commits记录
def print_user_commits(userFileName):
    usersFile = open(userFileName, 'r')
    users = usersFile.readlines()
    for i in range(0, len(users)):
        users[i] = users[i].rstrip('\n')
    userDataNum = 0
    for userName in users:
        print("Begin to print " + userName + "'s commits")
        fileName = "TXT/User" + str(userDataNum) + ".txt"
        myFile = open(fileName, 'w')
        myFile.write(userName+"\n")
        lastWeekCommitsUrl = "https://github.com/"+userName+"?tab=overview&from=2018-12-23&to=2018-12-29"
        req = requests.get(lastWeekCommitsUrl)
        html = req.text
        soup = BeautifulSoup(html,"lxml")
        lastWeekProjectCommits = soup.find_all('a', class_ = 'f6 muted-link ml-1')
        cnt = 0
        for commits in lastWeekProjectCommits:
            #获取项目名
            href = str(commits.get('href'))
            projectName = ""
            for letter in href:
                if letter != "?":
                    projectName = projectName + letter
                else: 
                    break
            projectName = projectName[:len(projectName)-7]
            
            #获取每个项目这周的commits信息
            detailedCommitUrl = "https://github.com"+projectName+"commits?author="+userName+"&since=2018-12-23&until=2018-12-30"
            time.sleep(int(random.uniform(2,4)))
            subReq = requests.get(detailedCommitUrl)
            subHtml = subReq.text
            subSoup = BeautifulSoup(subHtml,"lxml")
            relativeTime = subSoup.find_all('relative-time')
            #获取每一次提交的时间
            for element in relativeTime:
                datetime = str(element.get('datetime'))
                onePiece = str(cnt)+" "+projectName+" "+datetime
                print(onePiece)
                myFile.write(onePiece + '\n')
                cnt = cnt + 1  
        myFile.close()
        userDataNum = userDataNum + 1
        print("print " + userName + "'s commits successfully!\n")

反爬虫问题

由于github网站有反爬虫机制,大概连续爬个几百个网页就会触发。这里采用最简单的破解法,加入一定的延时,并使用代理IP池,随机地改变每次访问的IP。但是效果并不理想,免费代理IP可用率极低,带来巨大的延时问题,可以使用付费代理IP或其他破解法吧,留作以后再研究吧。
使用代理IP池代码如下:

#判断本条代理IP是否可用
def is_ok(socket):
    header = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64)'
    }
    proxies = {
        'http': socket,
        'https': socket,
    }
    try:
        req=requests.get('http://httpbin.org/ip', headers=header, proxies=proxies)
        print('finish')
        print(req.text)
        return True
    except:
        print('no proxies')
        return False
        

#从国内代理IP网站爬取代理IP地址
#爬取20页 300条代理IP地址
def get_ip_pool():
    ipFile = open('IP.txt', 'w')
    for page in range(1,21,1):
        url = "https://www.kuaidaili.com/free/inha/" + str(page) + "/"
        req = requests.get(url)
        html = req.text
        soup = BeautifulSoup(html, 'lxml')
        allTd = soup.find_all('td')
        socket = ""
        for td in allTd:
            data = td.get('data-title')
            if data == "IP":
                socket = td.text.replace('\xa0'*8,'\n\n')
            if data == "PORT":
                socket = socket + ":" + td.text.replace('\xa0'*8,'\n\n')
                ipFile.write(socket + '\n')
    ipFile.close()


#返回一个随机的请求头 headers
def get_headers():
    user_agent_list = [ \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    UserAgent=random.choice(user_agent_list)
    headers = {'User-Agent': UserAgent}
    return headers


#返回一个随机的代理IP地址
def get_proxy():
    proxyFile = open('IP.txt', 'r')
    proxyList = proxyFile.readlines()
    for i in range(0, len(proxyList)):
        proxyList[i] = proxyList[i].rstrip('\n')
    proxy = random.choice(proxyList)
    #print(proxyList)
    proxies = {
        'http': proxy,
        'https': proxy,
    }
    proxyFile.close()
    return proxies

#使用时用如下代码代替上面的requests.get()即可
while True:
    try:
        mainReq = requests.get(eachMonthCommitsUrl, headers=get_headers(), proxies=get_proxy())
    except:
        continue

总结

大致内容就是这样,主要是在反爬虫问题上的处理,这个就暂时留作以后再研究吧,目前也只能做到这样了,有兴趣的朋友可以下载完整源码瞅瞅,虽然写得很丑吧。

你可能感兴趣的:(爬虫)