爬虫实例记录

模仿人操作点击浏览器,再爬取数据
主要使用的 python package: selenium, pandas

使用selenium中的方法:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

通过webdriv打开Chrome,get网页

driver = webdriver.Chrome(options =chrome_options)
driver.get(MAIN_PAGE_URL)

登录

def login(driver):
# click continue with email with XPATH - doable(通过XPATH获取按钮)
email_btn = driver.find_element(By.XPATH, XPATH_email_btn)
# 点击按钮
email_btn.click()
# 必须强制休眠!!!否则会被侦察到脚本操作
time.sleep(3)

# try input user info - workable
email_input = driver.find_element(By.NAME, "user[email]")
email_input.send_keys(USER_EMAIL)

# find "continue" button and click to submit email
login_form = driver.find_element(By.XPATH, XPATH_login_form)
login_form.submit()
# TODO: after submitting, not shown (might need to sleep)
time.sleep(17)

# find password input form and input password
pw_input = driver.find_element(By.NAME, "user[password]")
pw_input.send_keys(PASSWORD)
pw_input.submit()
time.sleep(2)

爬取数据

最关键的一步:获取page_source

html_source = driver.page_source

接着将page_source split 直到得到想要的数据

举例:

先创建空字典

df_dict = {
“confirmation”: [],
“status”: [],
}

page_source split

all_entries = html_source.split(“Booking details”)[1].split(“18px;”>“)[1:]
confirmation = all_entries[4].split(”

status = html_source.split(”
“)[1].split(”>“)[1].split(”<")[0]

存入字典

df[“confirmation”].append(confirmation)
df[“status”].append(status)

数据保存csv

df_ = pd.DataFrame(df_dict)
df_ .to_csv(PATH, index=False)

反扒技术:

  1. 添加 USER_AGENT:
    获取 local USER_AGENT:在Chrome输入:chrome://version
    chrome_options = Options()
    chrome_options.add_argument(“–user-agent=” + USER_AGENT)

  2. 伪装 USER_AGENT:
    可在Chrome安装插件 User-Agent
    安装之后可伪装成不同的USER_AGENT

  3. 换ip地址
    举例:
    proxy_host = ‘127.0.0.1’
    proxy_port = “8888”
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(‘–proxy-server=http://{}:{}’.format(proxy_host, proxy_port))

  4. 重置cookies

创建第一个driver0,get网页,并保存cookies

driver0 = webdriver.Chrome(options =chrome_options)
driver0.get(MAIN_PAGE_URL)
save_Cookied = driver0.get_cookies()

创建第二个driver,get网页,先删除cookies(反扒关键),再加入保存的cookies

driver = webdriver.Chrome(options =chrome_options)
driver.get(MAIN_PAGE_URL)
driver.delete_all_cookies()
for cookie in save_Cookied:
driver.add_cookie(cookie)

你可能感兴趣的:(爬虫,python,开发语言)