一个简单的python爬虫程序

#简介
在每次论文被拒再投的过程中,都需要查询最近的与自己论文相关的会议列表。每到这种情况,我一遍采用的是遍历会伴www.myhuiban.com的网站,然后逐个查看会议,关注的有三点,投稿日期,ccf类别,会议相关内容。

##TODO: 需要加网站的截图,更清晰的说明为什么会有这个需求
思考下,也许自己可以写一个简单的python爬虫程序,将所有的会议列表下载下来,然后在本地建立一个所有会议的索引, 这样就可以个性化的定义搜索了。

昨天下午到今天中午,写了大概400行的python代码,基本完成需求。遇到的主要问题有模拟网站登录,正则表达式搜索, json的编码和解码,文件的读取与写入, 命令行参数相关。

#整体框架
Cookie模块: 负责模拟用户登录并保存cookie以便于下次使用
爬虫模块: 首先爬取会议的第一页的所有会议,并解析下一页的网址,然后逐个爬取下一页的会议列表,对每一个会议,根据其网址爬取其会议的详细信息,并保存到内存字典以及文件中
Json转换模块: 负责将会议字典对象转换为字符串并保存到文件,以及从文件中加载字典创建会议列表
搜索模块: 构造命令行参数,遍历会议进行模式匹配。
#具体模块
##Cookie模块
待爬虫的网址需要用户登录后才可以正常访问会议列表信息,因此需要手动模拟用户登录,并保存cookie,以便于下次访问直接使用cookie访问,而不需要再次登录了。

模拟登录并保存cookie的代码片段

    def requestCookie(self):
        #声明一个MozillaCookieJar对象实例来保存cookie,之后写入文件
        self.cookie = cookielib.MozillaCookieJar(self.filename)
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie))
        postdata = urllib.urlencode({
                    'LoginForm[email]':'xxxxxxx',
                    'LoginForm[password]':'xxxxxx',
                    'LoginForm[rememberMe]':'0',
                    'LoginForm[rememberMe]':'1',
                    'yt0':'登录'
                })
        loginUrl = 'http://www.myhuiban.com/login'
        #模拟登录,并把cookie保存到变量
        result = opener.open(loginUrl,postdata)
        #保存cookie到cookie.txt中
        self.cookie.save(ignore_discard=True, ignore_expires=True)

上述代码首先创建了MozillaCookie的对象,然后创建了opener,利用它打开连接,它将通过POST请求将用户名和密码发送给服务器。
实用tips:
也许你可能需要爬取其他的需要登录的网页, 它需要POST的数据格式可能就会有变化,如何处理呢? 第一,利用浏览器的自带的开发者工具查看当登录时候POST的数据是什么, 第二,手动构造postdata,利用上面的python代码发起登录请求, 第三,对比浏览器返回的Cookie值与手动模拟登录的请求的Cookie值是否相同。

从文件中直接载入Cookie代码

        self.cookie = cookielib.MozillaCookieJar()
        self.cookie.load(self.filename, ignore_discard=True, ignore_expires=True)
        for item in self.cookie:  
            print 'LoadCookie: Name = '+item.name  
            print 'LoadCookie:Value = '+item.value

通过打印对比一下是否和存储的一致。

Cookie文件加载和网络加载合并
通过以下的代码首先从文件中载入,如果载入不到就从网络中请求,请求完毕后保存到文件。通过这种方式,相当于建立了一个缓存机制,不需要每次都从网路中读取Cookie了。

def getCookie(self):
    if(self.cookie == None):    
        self.loadCookie()
    if(self.cookie == None):
        self.requestCookie()
    return self.cookie

##会议对象
将所有会议保存到会议对象中,需要定义会议对象。根据网页上会议的相关属性,定义如下的会议对象:

class Conference:
    #注意jsonDecoder这个函数的名字必须完全一致,传入的参数名字也必须一致
    def __init__(self, ccf, core, qualis, simple, traditinal, delay, sub, note, conf, place, num, browser, content, rate, id):
        self.ccf = ccf
        self.core = core
        self.qualis = qualis
        self.simple = simple
        self.traditinal = traditinal
        self.delay = delay
        self.sub = sub
        self.note = note
        self.conf = conf
        self.place = place
        self.num = num
        self.browser = browser
        self.content = content
        self.id = id
        self.rate = rate
        
    def setContent(self, content):
        self.content = content
    def setAcceptRate(self, rate):
        self.rate = rate
    def printConf(self):
        print 'CCF CORE QUALIS ShortN Long Delay SubmissionDt Notification Conference Place NumberedHold Nread'
        #print('x=%s, y=%s, z=%s' % (x, y, z))
        print self.ccf, self.core , self.qualis , self.simple , self.traditinal , self.delay , self.sub , self.note , self.conf , self.place , self.num , self.browser
    def printDetailConf(self):
        self.printConf()
        print('Content:')
        print self.content
    def isValid(self):
        return self.id == -1
    def json(self):
        pass
    @staticmethod    
    def createConf(line):
        #u'cb2IDEALInternational Conference on Intelligent Data Engineering and Automated Learning2016-05-152016-06-152016-10-12Wroclaw, Poland173932'
        #CCF	CORE	QUALIS	简称	全称	延期	截稿日期	通知日期	会议日期	会议地点	届数	浏览
        pattern = r'(.*?)(.*?)(.*?)(.*?)(.*)(.*?)(.*?)(.*?)(.*?)(.*?)(.*?)(.*?)'
        info = re.findall(pattern, line, re.S)[0]
        
        conf = Conference()
        conf.ccf = info[0]
        conf.core = info[1]
        conf.qualis = info[2]
        conf.simple = info[3]
        conf.id = info[4]
        conf.traditinal = info[5]
        conf.delay = info[6]
        conf.sub = info[7]
        conf.note = info[8]
        conf.conf = info[9]
        conf.place = info[10]
        conf.num = info[11]
        conf.browser = info[12]
        #conf.printConf()
        return conf

此会议对象接收一段html代码,并进行解析,将相应的属性解析出来。

##json相关
考虑到不能每次使用此程序都需要爬取所有的会议列表,因此需要将会议对象转换为字符串保存到文件中。 常用的对象序列化方法之一就是json

于是借用网络上的一份对象和json串的转换代码如下:
特此声明:下面的转换代码非本人所写,copy于网络,链接地址为:http://www.cnblogs.com/coser/archive/2011/12/14/2287739.html

class MyEncoder(json.JSONEncoder):
    def default(self,obj):
        #convert object to a dict
        d = {}
        d['__class__'] = obj.__class__.__name__
        d['__module__'] = obj.__module__
        d.update(obj.__dict__)
        return d
 
class MyDecoder(json.JSONDecoder):
    def __init__(self):
        json.JSONDecoder.__init__(self,object_hook=self.dict2object)
    def dict2object(self,d):
        #convert dict to object
        if'__class__' in d:
            class_name = d.pop('__class__')
            module_name = d.pop('__module__')
            module = __import__(module_name)
            class_ = getattr(module,class_name)
            args = dict((key.encode('ascii'), value) for key, value in d.items()) #get args
            #pdb.set_trace()
            inst = class_(**args) #create new instance
        else:
            inst = d
        return inst

上面简单的两个类就可以用来将具体的字符串和对象进行互转。
例如创建了一个p对象,可使用d = MyEncoder().encode§将p编码为json字符串,然后再使用o = MyDecoder().decode(d)将字符串进行解码为p对象。
这个里面需要注意的是dict2object解码函数中,通过inst = class_(**args)创建对象,这会调用class_对应类的构造函数。 对于本文的代码,也就是调用Conference类的__init__构造函数,由于args是一个字典类型,key为Conference类的属性名字,这也就要求构造函数的参数名字与类的属性名字一一对应

##爬虫模块
此模块逻辑相对复杂,下载会议主页的html代码并解析出会议列表,然后对会议列表中的每一个会议继续下载会议详情内容,当前页的会议全部下载完毕后,启动下载下一页的会议列表,直到所有页处理完成。 当所有页面完毕后通过storeConf保存所有的会议到文件中,以便于下一次启动后,直接从文件中读取。

代码如下:

class UrlCrawler:
    def __init__(self):
        self.rootUrl = "http://www.myhuiban.com"
        self.mgr = CookieMgr()
        self.curPage = 1
        self.endPage = -1
        self.confDic = {}
    def requestUrl(self, url):
        r = requests.get(url, cookies=self.mgr.getCookie(), headers=self.mgr.getHeader())
        r.encoding = 'utf-8'#设置编码格式
        return r
    def setRoot(self, url):
        self.rootUrl = url
    def firstPage(self, url):
        self.pageHtml = self.requestUrl(url).text
        self.endPage = self.searchEndPage()
        self.procAPage()
    def hasNext(self):
        #pdb.set_trace()
        return self.curPage <= self.endPage
    def nextPage(self):
        #http://www.myhuiban.com/conferences?Conference_page=14
        self.pageHtml = self.requestUrl(self.rootUrl+"/conferences?Conference_page="+str(self.curPage)).text
        self.procAPage()
    def procAPage(self):
        ...parser pageHtml to table
        for i in range(len(table)):
            #print table[i]
            conf = Conference.createConf(table[i])
            self.parseConfContent(conf)
            self.confDic[conf.simple] = conf
        self.curPage += 1
    def parseConfContent(self, conf):
       ...
    def requestConf(self):
        #load first page
        self.firstPage(self.rootUrl+"/conferences")
        #print self.confDic
        while(self.hasNext()):
            self.nextPage()
    def crawlWeb(self):
        if(len(self.confDic) == 0):
            self.loadConf()
        if(len(self.confDic) == 0):
            self.requestConf()    
        self.storeConf()    
        
    def storeConf(self):
        f = file('conferences.txt', 'w')
        confJson = json.dumps(self.confDic, cls=MyEncoder)
        #print confJson
        f.write(confJson)
        f.close()
    def loadConf(self):
        f = file('conferences.txt', 'r')
        confJson = f.read()
        self.confDic = MyDecoder().decode(confJson)
        #pdb.set_trace()
        #print confJson
        f.close()
        
    def searchEndPage(self):
        pattern = r'
  • ' pattern = r'
  • ' #print self.pageHtml return int(re.search(pattern, self.pageHtml).group(1))
  • ##命令行选项模块
    一个程序有一个良好的外部接口,方便用户使用。因此,其使用getopt方式,传递命令行参数,以支持多种查询行为。

    主要代码如下:

    1. main函数根据用户的输入,构造不同的查询选项。
    2. SearchEngine类将查询选项传递给具体的会议对象,并调用match方法,判断此会议是否符合查询选项。如果符合,就打印当前会议信息。
    class SearchEngine:
        def __init__(self):
            self.crawler = UrlCrawler()
            self.crawler.crawlWeb()
        #date='6-8', ccf='a|b|c', conf='FAST',  key='xyz', showdetail=True    
        def searchConf(self, date='2016-0[6-8]', ccf='b', name='CIDR',  key='cloud', showdetail=True):
            pat = OnePattern(date, ccf, name, key, showdetail)
            matNum = 0
            for key, value in self.crawler.confDic.items():
                if(pat.match(value)):
                    matNum += 1
            print 'Total Matched Conference Num: ', matNum
    
    if __name__ == '__main__':
        opts, args = getopt.getopt(sys.argv[1:], "hd:c:n:k:s:", ['help', 'date=', 'ccf=', 'name=',  'key=', 'showDetail='])
        date = None
        ccf = None
        name = None
        key = None
        showdetail = False
        for op, value in opts:
            if op in ("-d", '--date'):
                date = value    
            elif op in ("-c", '--ccf'):
                ccf = value
            elif op in ("-n", '--name'):
                name = value
            elif op in ("-k", "--key"):
                key = value
            elif op in ("-s", "-showDetail"):
                showdetail = value
            elif op in ("-h", "--help"):
                print 'Usage:', sys.argv[0], 'options'
                print '    Mandatory arguments to long options are mandatory for short options too.'
                print '    -d, --date=DATE  search the given date, default none.'
                print '    -c, --ccf=[abc]  search the given CCF level, default none.'
                print '    -n, --name=NAME  search the given conference name(support shor and long name both).'
                print '    -k, --key=KEYWORD  search the given KEYWORD in conference description.'
                print '    -h, --help  print this help info.'
                print 'Example:'
                print '    ', sys.argv[0], '-d 2016-0[6-8] -c b -n CIDR -k cloud -s 1'
                print '        ', 'search the conference whose submission date from 2016-06 to 2016-08, level is CCF B, keyword is cloud and detail information is shown.'
                print '    ', sys.argv[0], '--name=CIDR'
                print '    ', sys.argv[0], '-n CIDR'
                print '        ', 'search the conference name which contains CIDR.'
                print '    ', sys.argv[0], '-n ^CIDR$ -s 1'
                print '        ', 'search the conference name which contains exactly CIDR and print the detail info.'
                print '    ', sys.argv[0], '-c [ab] -k "cloud computing"'
                print '        ', 'search the conference who is CCF A or B and is about cloud computing.'
                print '    ', sys.argv[0], '-k "(cloud computing) | (distributed systems)"'
                print '        ', 'search the conference which is about cloud computing or distributed systems.'
                print '    ', sys.argv[0], '-k "cloud | computing | distributed"'
                print '        ', 'search the conference which contains cloud or computing or distributed.'
                print '    ', sys.argv[0], '-k "(Cloud migration service).*(cloud computing)"'
                print '        ', 'search the conference which contains both and are ordered.'
                print 
                sys.exit()
        SearchEngine().searchConf(date, ccf, name, key, showdetail)
        #print date, ccf, name, key, showdetail
        sys.exit(0)
    

    #爬虫代码
    我将完整的代码贴到这里。这份代码历时一天时间,并没有经过本人严格的测试,肯定会有bug存在。 我只能在以后使用的过程中根据发现的bug再逐步完善。

    这个里面用到了一些技巧,总结如下:

    1. 模拟网站登录代码
    2. 保存cookie以方便下次直接使用,减少请求
    3. 发起网络你求,解析返回的html,正则表达式的解析
    4. 对象和json的相互转化,自定义json编解码函数
    5. 对象保存到文件,并从下次从文件中直接载入,避免每次都需要网络请求
    6. 长短命令行参数,各种情况都支持

    代码技巧:

    1. getCookie方法的首先从文件中,失败再请求网络。
    2. UrlCrawler的firstPage和nextPage方法,开发了类似于迭代器接口的方法,清晰明了。

    可以提高的地方:

    1. 对象不保存到文件,保存到数据库,通过sql语句操纵数据库进行查询
    2. Pattern对象的抽象层次,虽然想实现支持(a|b) & (c|d)这样的表达,但是实在不想让代码复杂化,就没实现。
    3. 实现查询界面,但是python使用界面(TK什么的,本人未深入研究)需要本地启动Xming的server,让我很不爽,不打算详细研究。本人希望的界面是可以直接写完代码传递给python解释器就可以看到界面的,不应该依赖额外什么server。

    额外声明:
    本人没有深入研究过python,只是随便看了一本最基本的python书籍(只看了前面的基本语法以及支持的数据结构),可能python的特性并没有完全把握。 因此,以下的代码有些地方可能会显得不合理(我自己并没有这种感觉),毕竟我的思想来源于c++。

    #!/usr/bin/python
    # -*- coding:utf-8 -*-
    
    import urllib
    import urllib2
    import cookielib
    import requests
    import sys
    import os
    import pdb
    import re
    import json
    import getopt
    #pdb.set_trace()
    
    class CookieMgr:
        def __init__(self):
            self.cookie = None
            self.filename = 'cookie.txt'
        def getCookie(self):
            if(self.cookie == None):    
                self.loadCookie()
            if(self.cookie == None):
                self.requestCookie()
            return self.cookie
        def loadCookie(self):
            self.cookie = cookielib.MozillaCookieJar()
            self.cookie.load(self.filename, ignore_discard=True, ignore_expires=True)
            for item in self.cookie:  
                print 'LoadCookie: Name = '+item.name  
                print 'LoadCookie:Value = '+item.value
        def requestCookie(self):
            #声明一个MozillaCookieJar对象实例来保存cookie,之后写入文件
            self.cookie = cookielib.MozillaCookieJar(self.filename)
            opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie))
            postdata = urllib.urlencode({
                        'LoginForm[email]':'xxxxxxx',
                        'LoginForm[password]':'xxxxxx',
                        'LoginForm[rememberMe]':'0',
                        'LoginForm[rememberMe]':'1',
                        'yt0':'登录'
                    })
            loginUrl = 'http://www.myhuiban.com/login'
            #模拟登录,并把cookie保存到变量
            result = opener.open(loginUrl,postdata)
            #保存cookie到cookie.txt中
            self.cookie.save(ignore_discard=True, ignore_expires=True)
            for item in self.cookie:  
                print 'ReqCookie: Name = '+item.name  
                print 'ReqCookie:Value = '+item.value
        def badCode(self):
            #server return 500 internal error, not use, just show tips
            login_url = 'http://www.myhuiban.com/conferences'
            #利用cookie请求访问另一个网址,此网址是成绩查询网址
            #gradeUrl = 'http://www.myhuiban.com/conferences'
            #请求访问成绩查询网址
            #result = opener.open(gradeUrl)
            #print result.read()
        def test(self):
            mgr = CookieMgr()
            cookie = mgr.getCookie() 
        def directCookie(self):
            #在http://www.myhuiban.com/login登录页面,发起网络请求,通过浏览器自带的开发者工具,观测网络发送的数据包,获取cookie
            cookies = {
                'PHPSESSID':'3ac588c9f271065eb3f1066bfb74f4e9',
                'cb813690bb90b0461edd205fc53b6b1c':'9b40de79863acaa3df499703611cdb1e123b15c9a%3A4%3A%7Bi%3A0%3Bs%3A14%3A%22lpstudy%40qq.com%22%3Bi%3A1%3Bs%3A14%3A%22lpstudy%40qq.com%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A3%3A%7Bs%3A2%3A%22id%22%3Bs%3A4%3A%221086%22%3Bs%3A4%3A%22name%22%3Bs%3A8%3A%22Lp+Study%22%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%22n%22%3B%7D%7D',
                '__utmt':'1',
                '__utma':'201260338.796552597.1428908352.1463018481.1463024893.21',
                '__utmb':'201260338.15.10.1463024893',
                '__utmc':'201260338',
                '__utmz':'201260338.1461551356.19.9.utmcsr=baidu|utmccn=(organic)|utmcmd=organic',
            }
            return cookies
        def getHeader(self):
            headers = {
                'Host': 'www.myhuiban.com',
                'Connection': 'keep-alive',
                'Cache-Control': 'max-age=0',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
                'Accept-Encoding': 'gzip, deflate, sdch',
                'Accept-Language': 'zh-CN,zh;q=0.8',
            }
            return headers
           
    
    confer_url = 'http://www.myhuiban.com/conferences'
    
    
    #这是会议对象,对应某一个会议
    #每一个会议对应的基本结构:
    #CCF	CORE	QUALIS	简称	全称	延期	截稿日期	通知日期	会议日期	会议地点	届数	浏览
    #具体会议对应的结构:
    #http://www.myhuiban.com/conference/1733
    #征稿: 会议描述
    #录取率: 对应历年的状态 
    class Conference:
        #注意jsonDecoder这个函数的名字必须完全一致,传入的参数名字也必须一致
        def __init__(self, ccf, core, qualis, simple, traditinal, delay, sub, note, conf, place, num, browser, content, rate, id):
            self.ccf = ccf
            self.core = core
            self.qualis = qualis
            self.simple = simple
            self.traditinal = traditinal
            self.delay = delay
            self.sub = sub
            self.note = note
            self.conf = conf
            self.place = place
            self.num = num
            self.browser = browser
            self.content = content
            self.id = id
            self.rate = rate
            
        def setContent(self, content):
            self.content = content
        def setAcceptRate(self, rate):
            self.rate = rate
        def printConf(self):
            print 'CCF CORE QUALIS ShortN Long Delay SubmissionDt Notification Conference Place NumberedHold Nread'
            #print('x=%s, y=%s, z=%s' % (x, y, z))
            print self.ccf, self.core , self.qualis , self.simple , self.traditinal , self.delay , self.sub , self.note , self.conf , self.place , self.num , self.browser
        def printDetailConf(self):
            self.printConf()
            print('Content:')
            print self.content
        def isValid(self):
            return self.id == -1
        def json(self):
            pass
        @staticmethod    
        def createConf(line):
            #u'cb2IDEALInternational Conference on Intelligent Data Engineering and Automated Learning2016-05-152016-06-152016-10-12Wroclaw, Poland173932'
            #CCF	CORE	QUALIS	简称	全称	延期	截稿日期	通知日期	会议日期	会议地点	届数	浏览
            pattern = r'(.*?)(.*?)(.*?)(.*?)(.*)(.*?)(.*?)(.*?)(.*?)(.*?)(.*?)(.*?)'
            info = re.findall(pattern, line, re.S)[0]
            
            conf = Conference()
            conf.ccf = info[0]
            conf.core = info[1]
            conf.qualis = info[2]
            conf.simple = info[3]
            conf.id = info[4]
            conf.traditinal = info[5]
            conf.delay = info[6]
            conf.sub = info[7]
            conf.note = info[8]
            conf.conf = info[9]
            conf.place = info[10]
            conf.num = info[11]
            conf.browser = info[12]
            #conf.printConf()
            return conf
    class MyEncoder(json.JSONEncoder):
        def default(self,obj):
            #convert object to a dict
            d = {}
            d['__class__'] = obj.__class__.__name__
            d['__module__'] = obj.__module__
            d.update(obj.__dict__)
            return d
     
    class MyDecoder(json.JSONDecoder):
        def __init__(self):
            json.JSONDecoder.__init__(self,object_hook=self.dict2object)
        def dict2object(self,d):
            #convert dict to object
            if'__class__' in d:
                class_name = d.pop('__class__')
                module_name = d.pop('__module__')
                module = __import__(module_name)
                class_ = getattr(module,class_name)
                args = dict((key.encode('ascii'), value) for key, value in d.items()) #get args
                #pdb.set_trace()
                inst = class_(**args) #create new instance
            else:
                inst = d
            return inst
            
    
        
    class UrlCrawler:
        def __init__(self):
            self.rootUrl = "http://www.myhuiban.com"
            self.mgr = CookieMgr()
            self.curPage = 1
            self.endPage = -1
            self.confDic = {}
        def requestUrl(self, url):
            r = requests.get(url, cookies=self.mgr.getCookie(), headers=self.mgr.getHeader())
            r.encoding = 'utf-8'#设置编码格式
            return r
        def setRoot(self, url):
            self.rootUrl = url
        def firstPage(self, url):
            self.pageHtml = self.requestUrl(url).text
            self.endPage = self.searchEndPage()
            self.procAPage()
        def hasNext(self):
            #pdb.set_trace()
            return self.curPage <= self.endPage
        def nextPage(self):
            #http://www.myhuiban.com/conferences?Conference_page=14
            self.pageHtml = self.requestUrl(self.rootUrl+"/conferences?Conference_page="+str(self.curPage)).text
            self.procAPage()
        def procAPage(self):
            print 'process page: ', self.curPage, ' end:', self.endPage
            pattern = r'\n(.*?)\n'
            #print self.pageHtml
            table = re.findall(pattern, self.pageHtml, re.S)[0]
            #pdb.set_trace()
            re_class = re.compile(r'\n')
            table = re_class.sub('', table)#去掉span标签
            re_style=re.compile('',re.I)#不进行超前匹配
            table = re_style.sub('', table)
            re_style=re.compile('',re.I)#去掉span
            table = re_style.sub('', table)
            
            table = table.split('\n')
            for i in range(len(table)):
                #print table[i]
                conf = Conference.createConf(table[i])
                self.parseConfContent(conf)
                self.confDic[conf.simple] = conf
            self.curPage += 1
        def parseConfContent(self, conf):
            html = self.requestUrl(self.rootUrl + conf.id).text
            #print html
            pattern = r'
    \n
    (.*?)
    ' pattern = r'
  • ' #print self.pageHtml return int(re.search(pattern, self.pageHtml).group(1)) class Pattern: def __init__(self): self.patList = list() def match(self, conf): return True def addPat(pat): self.patList.append(pat) class OnePattern: def __init__(self, date, ccf, conf, key, showdetail): self.date = date self.ccf = ccf self.conf = conf#name self.key = key self.showdetail = showdetail def match(self, conf): res = True if(self.date): res = re.findall(self.date, conf.sub, re.S) if(self.conf): res = (re.findall(self.conf, conf.simple, re.S) or re.findall(self.conf, conf.traditinal, re.S)) and res if(self.ccf): #print conf.ccf, self.ccf, re.findall(self.ccf, conf.ccf, re.S) res = re.findall(self.ccf, conf.ccf, re.S) and res if(self.key): res = re.findall(self.key, conf.content, re.S) and res if(res): if(not self.showdetail): conf.printConf() else: conf.printDetailConf() return True else: return False class TwoOpPattern(Pattern): def __init__(self, a, b): Pattern.__init__(self) self.patList.append(a) self.patList.append(b) class OrPattern(TwoOpPattern): def match(self, conf): return self.patList[0].match(conf) or self.patList[1].match(conf) class AndPattern(TwoOpPattern): def match(self, conf): return self.patList[0].match(conf) and self.patList[1].match(conf) class NotPattern(Pattern): def match(self, conf): return not self.patList[0].match(conf) class SearchEngine: def __init__(self): self.crawler = UrlCrawler() self.crawler.crawlWeb() #date='6-8', ccf='a|b|c', conf='FAST', key='xyz', showdetail=True def searchConf(self, date='2016-0[6-8]', ccf='b', name='CIDR', key='cloud', showdetail=True): pat = OnePattern(date, ccf, name, key, showdetail) matNum = 0 for key, value in self.crawler.confDic.items(): if(pat.match(value)): matNum += 1 print 'Total Matched Conference Num: ', matNum if __name__ == '__main__': opts, args = getopt.getopt(sys.argv[1:], "hd:c:n:k:s:", ['help', 'date=', 'ccf=', 'name=', 'key=', 'showDetail=']) date = None ccf = None name = None key = None showdetail = False for op, value in opts: if op in ("-d", '--date'): date = value elif op in ("-c", '--ccf'): ccf = value elif op in ("-n", '--name'): name = value elif op in ("-k", "--key"): key = value elif op in ("-s", "-showDetail"): showdetail = value elif op in ("-h", "--help"): print 'Usage:', sys.argv[0], 'options' print ' Mandatory arguments to long options are mandatory for short options too.' print ' -d, --date=DATE search the given date, default none.' print ' -c, --ccf=[abc] search the given CCF level, default none.' print ' -n, --name=NAME search the given conference name(support shor and long name both).' print ' -k, --key=KEYWORD search the given KEYWORD in conference description.' print ' -h, --help print this help info.' print 'Example:' print ' ', sys.argv[0], '-d 2016-0[6-8] -c b -n CIDR -k cloud -s 1' print ' ', 'search the conference whose submission date from 2016-06 to 2016-08, level is CCF B, keyword is cloud and detail information is shown.' print ' ', sys.argv[0], '--name=CIDR' print ' ', sys.argv[0], '-n CIDR' print ' ', 'search the conference name which contains CIDR.' print ' ', sys.argv[0], '-n ^CIDR$ -s 1' print ' ', 'search the conference name which contains exactly CIDR and print the detail info.' print ' ', sys.argv[0], '-c [ab] -k "cloud computing"' print ' ', 'search the conference who is CCF A or B and is about cloud computing.' print ' ', sys.argv[0], '-k "(cloud computing) | (distributed systems)"' print ' ', 'search the conference which is about cloud computing or distributed systems.' print ' ', sys.argv[0], '-k "cloud | computing | distributed"' print ' ', 'search the conference which contains cloud or computing or distributed.' print ' ', sys.argv[0], '-k "(Cloud migration service).*(cloud computing)"' print ' ', 'search the conference which contains both and are ordered.' print sys.exit() SearchEngine().searchConf(date, ccf, name, key, showdetail) #print date, ccf, name, key, showdetail sys.exit(0)
  • 你可能感兴趣的:(脚本)