抓取免费代理返回空列表?!

PYTHON网络爬虫概述

之前按照一些帖子的步骤去爬取代理,返回一个空列表?!才隔一两个月啊,怎么会没用啊?

偶然间翻到一篇帖子,大致意思就是这些免费代理网站也设置了IP反爬虫!!!这么坑?giao!

试了几次都确定自己的代码没有错,就去检查代码吧!
话不多说,直接上修改后的代码

'''
    快代理免费代理爬取
'''
#导入模块
import requests
from lxml import etree
from fake_useragent import UserAgent
ua = UserAgent().random
#获得源代码
url = 'https://www.kuaidaili.com/free/inha/1/'
headers = {
     'User-Agent':ua}
res = requests.get(url,headers=headers).text
#获取ip和port
html = etree.HTML(res)
'''
    注意这里!!!
    在对源代码进行网页解析的时候发现无法抓取?这是怎么回事?
    再去看一遍网页源代码,DOM没错啊
    再去看一遍抓取到的代码,对比之下发现,网页的源代码和抓取到的代码不一样!!!
    那我们以抓取到的代码为主即可找到所要抓取的值
'''
HTTPs = html.xpath('//tbody/tr/td[4]/text()')
IPs = html.xpath('//tbody/tr/td[1]/text()')
Ports = html.xpath('//tbody/tr/td[2]/text()')
list = []
for HTTP,IP,Port in zip(HTTPs,IPs,Ports):
    value = HTTP,IP,Port
    list.append(HTTP+':'+IP+':'+Port)
print(list)
for i in list:
    print(i)

爬取到是这样的

#输出结果如下
['HTTP:60.167.132.220:9999', 'HTTP:36.248.132.85:9999', 'HTTP:39.106.223.134:80', 'HTTP:110.243.25.158:9999', 'HTTP:120.83.104.228:9999', 'HTTP:36.248.129.130:9999', 'HTTP:220.249.149.25:9999', 'HTTP:171.35.160.221:9999', 'HTTP:123.101.231.54:9999', 'HTTP:182.46.110.254:9999', 'HTTP:163.204.240.202:9999', 'HTTP:113.195.17.123:9999', 'HTTP:115.218.5.222:9000', 'HTTP:125.108.126.217:9000', 'HTTP:110.243.5.254:9999']
HTTP:60.167.132.220:9999
HTTP:36.248.132.85:9999
HTTP:39.106.223.134:80
HTTP:110.243.25.158:9999
HTTP:120.83.104.228:9999
HTTP:36.248.129.130:9999
HTTP:220.249.149.25:9999
HTTP:171.35.160.221:9999
HTTP:123.101.231.54:9999
HTTP:182.46.110.254:9999
HTTP:163.204.240.202:9999
HTTP:113.195.17.123:9999
HTTP:115.218.5.222:9000
HTTP:125.108.126.217:9000
HTTP:110.243.5.254:9999

爬取到之后小伙伴就可以及时去设置代理了!

这是爬取第一页的代理的代码,测试是成功的!
如果想爬取后面n页的代码,直接改一个URL里面的参数做个循环就OK了。

你可能感兴趣的:(人生苦短,我爱Python,python)