SGMLParseError: unexpected ',' char in declaration错误,解决:scrapy添加下载中间件

用scrapy写爬虫的时候,出现错误
SGMLParseError: unexpected ',' char in declaration
检查了代码,发现没问题,于是用bing搜了一下,没有国内的资料,只有国外的,
http://computer-programming-forum.com/56-python/8a28baf87155c619.htm

I'm guessing there's a comment in your HTML files that is spelled like this:
I'm not an expert in SGML, but I do know that it has an oft misunderstood definition of a comment. I think the above is a valid comment, but SGMLlib (of course) doesn't parse it right, resulting in your error.
The other possibility is your file has one of those silly things at the top that no one knows what the hell it is. Maybe it erroneously (in the opinion of SGMLlib) has a colon in it.

-- CARL BANKS

提到 看到报错的函数,然后发现

 # Internal -- parse declaration (for use by subclasses).
    def parse_declaration(self, i):
        # This is some sort of declaration; in "HTML as
        # deployed," this should only be the document type
        # declaration ("").
        # ISO 8879:1986, however, has more complex
        # declaration syntax for elements in , including:
        # --comment--
        # [marked section]
        # name in the following list: ENTITY, DOCTYPE, ELEMENT,
        # ATTLIST, NOTATION, SHORTREF, USEMAP,
        # LINKTYPE, LINK, IDLINK, USELINK, SYSTEM
      。。。
      self.error(    "unexpected %r char in declaration" % rawdata[j])

接着查了一下抓取的网页,果然有一个

猎豹浏览器点我》》

于是打算处理一下,由于是这个页面是scrapy自己分析的,然后通过rules判断是否给我,那么久需要在scrapy下载到网页的时候就处理一下,把这个非法字符串给删掉,
最后结果,添加一个下载中间件,通过process_response处理

#在setting里面加上代码,配置一下自己的下载器中间件

DOWNLOADER_MIDDLEWARES = {
    'discountSpider.middlewares.CustomDownloaderMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

#middlewares.py的代码
#实现process_response,下载器下载完毕网页后,就会经过这个函数,我们可以修改网页内容
class CustomDownloaderMiddleware(object):
    #下载器即将开始请求,修改爬虫request中默认的User-Agent,防止被识别
    def process_request(self,request,spider):
        ua = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        request.headers.setdefault('User-Agent', ua)

    #下载器将文档下载完毕后,有的需要在这里来一些处理
    def process_response(self,request, response, spider):
        print 'process_response:'+response.url
        #xxx.com的某些列表页面会出现之类的字符串,里面有DOCTYPE非法的,:之类的符号,导致SGMLParse模块解析错误,抛出异常
        if response.url.find('xxx.com/page')>=0 or response.url.find('xxx.com/1-0-a-0-0/p')>=0:
            #lang="
            newbody = response.body.replace('lang="','lang="')
            response = response.replace(body=newbody)

        #必须返回一个response
        return response

你可能感兴趣的:(SGMLParseError: unexpected ',' char in declaration错误,解决:scrapy添加下载中间件)