爬虫报错集

报错一:UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa0’ in position 41: illegal multibyte sequence

拉勾网数据抓取中,抓取一段数据后出现如下报错:\u200e
UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa0’ in position 41: illegal multibyte sequence

检查原因是在抓取到的数据在写入csv的时候出现的问题,查看源码如下:

    def csv_writer(self,position):
        title = ['job_name','company','salary','city','education','workyear','job_des']
        with open('lagou.csv','a',newline='') as f:
            writer = csv.DictWriter(f,title)
            writer.writerow(position)

分析:在文件写入的时候报的错误,因万恶的windows打开文件默认是以“gbk“编码的,可能造成不识别unicode字符,于是做了如下的修改:

    def csv_writer(self,position):
        title = ['job_name','company','salary','city','education','workyear','job_des']
        with open('lagou.csv','a',newline='', **encoding = 'utf-8'**) as f:
            writer = csv.DictWriter(f,title)
            writer.writerow(position)

参考文章:https://blog.csdn.net/github_35160620/article/details/53353672

补充:
针对上述乱码的问题,如果想要强行写入,忽略乱码的问题,可以加errors=“ignore”
思路来源:https://blog.csdn.net/yanjiaxin1996/article/details/80113552

    def csv_writer(self,position):
        title = ['job_name','company','salary','city','education','workyear','job_des']
        with open('lagou.csv','a',newline='', errors="ignore") as f:
            writer = csv.DictWriter(f,title)
            writer.writerow(position)

报错二:SyntaxError: (unicode error) ‘unicodeescape’ codec can’t decode bytes in position 12-13: truncated \xXX escape

在读取目录文件的时候,出现如标题报错,经过分析后发现,路径中包含转义字符,需要转义后才能识别这些符号。

with open(r'C:\Py\spider\xici_proxy\ip.csv','r') as f:
    print(f.read())

报错三:SyntaxError: invalid character in identifier

请仔细检查问题原因就是代码中包含了无效字符。
请仔细认真的检查一下代码中有没有出现中文的“空格”、“等于”等符号。

proxy = {
            http: proxies[random.randint(0,len(proxies)-1)]
        }

你可能感兴趣的:(爬虫,python,报错)