【信息收集】url去重脚本

信息搜集完之后,面临着一大堆的url,往往有很多域名是重复的,于是我写了个脚本针对子域去重,给它取了个名字叫 Readytogo

用法很简单,速度也很快。

python readytogo.py -f url.txt

可根据实际情况修改配置,可调整对首部(去掉http头后)的子域名去重的字节数,默认是七个字节,也可更根据实际情况修改脚本对url尾部去重(默认不开启)。

虽然写的很垃圾呜呜呜但是全是自己动手敲出来的,值得纪念一下,之前都是照着师傅们的脚本照猫画虎,以后代码能力要加强啊!!!

# readytogo
import argparse
parser = argparse.ArgumentParser(description= "Are you ready?")
parser.add_argument('-f',type = str,required = True)
args = parser.parse_args()
##对数据头部去重
def showreal(mylist):    
    head = []
    yyj = []
    mynewlist = []
    for i in mylist:
        ihead = i[0:7] ##检测子域名的前7个字节个字节
        head.append(ihead)     
    for n in range(len(mylist)):
        if head[n] not in yyj:
            yyj.append(head[n])
            mynewlist.append(mylist[n])
    return mynewlist       
    
    
##对数据尾部去重            
def showreal2(mylist):    
    till = []
    yyj = []
    mynewlist = []
    for i in mylist:
        itill = i[(len(i)-5):(len(i)+1)]  ##5可调节 这里是检测尾部的5+1个字节
        #print(itill)
        till.append(itill)     
    for n in range(len(mylist)):
        if till[n] not in yyj:
            yyj.append(till[n])
            mynewlist.append(mylist[n])
    return mynewlist
    
    
def main():
    list1 = []
    list2 = []
    with open(args.f, "r") as file:  
        for i in file:
            if 'http://' in i:
                i = i.strip("http://")
                list1.append(i)
            elif 'https://' in i:
                i = i.strip("https://")
                list2.append(i)  
            else:
                list1.append(i)   
    list1 = showreal(list1)
    list2 = showreal(list2)
    list3 = []
    list4 = []
    for x in list1:
        x = "http://" + x 
        list3.append(x)
    for z in list2:
        z = "https://" + z
        list3.append(z)
    for i in list3:
        i = i.replace("\n",'')
        list4.append(i)
        print(i)
    ##list4 = showreal2(list4)  #默认不对尾部去重
    
    
    ##写入新的文件
    with open(args.f, "w") as f:
        for i in list4:
            if i != '':
                f.write(i + "\n")
    print("\nfinished")

if __name__ == "__main__":
    main()

你可能感兴趣的:(【信息收集】url去重脚本)