PlayWright的核心概念包括:
一个Browser是一个Chromium, Firefox 或 WebKit(plarywright支持的三种浏览器)的实例plarywright脚本通常以启动浏览器实例开始,以关闭浏览器结束。浏览器实例可以在headless(没有 GUI)或head模式下启动。Browser实例创建:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
browser.close()
启动browser实例是比较耗费资源的,plarywright做的就是如何通过一个browser实例最大化多个BrowserContext的性能。
API:
一个BrowserContex就像是一个独立的匿名模式会话(session),非常轻量,但是又完全隔离。
(译者注:每个browser实例可有多个BrowserContex,且完全隔离。比如可以在两个BrowserContext中登录两个不同的账号,也可以在两个 context 中使用不同的代理。 )
context创建:
browser = playwright.chromium.launch()
context = browser.new_context()
context还可用于模拟涉及移动设备、权限、区域设置和配色方案的多页面场景,如移动端context创建:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
iphone_11 = p.devices['iPhone 11 Pro']
browser = p.webkit.launch(headless=False)
context = browser.new_context(
**iphone_11,
locale='de-DE',
geolocation={ 'longitude': 12.492507, 'latitude': 41.889938 },
permissions=['geolocation']
)
browser.close()
API:
一个BrowserContext可以有多个page,每个page代表一个tab或者一个弹窗。page用于导航到URL并与page内的内容交互。
创建page:
page = context.new_page()
# Navigate explicitly, similar to entering a URL in the browser.
page.goto('http://example.com')
# Fill an input.
page.fill('#search', 'query')
# Navigate implicitly by clicking a link.
page.click('#submit')
# Expect a new url.
print(page.url)
# Page can navigate from the script - this will be picked up by Playwright.
# window.location.href = 'https://example.com'
一个page可以有多个frame对象,但只有一个主frame,所有page-level的操作(比如click),都是作用在主frame上的。page的其他frame会打上iframe
HTML标签,这些frame可以在内部操作实现访问。
# 通过name属性获取frame
frame = page.frame('frame-login')
# 通过URL获取frame
frame = page.frame(url=r'.*domain.*')
# 通过其他选择器(selector)获取frame
frame_element_handle = page.query_selector('.frame-class')
frame = frame_element_handle.content_frame()
# 与frame交互
frame.fill('#username-input', 'John')
在录制模式下,会自动识别是否是frame内的操作,不好定位frame时,那么可以使用录制模式来找。
API:
playwright可以通过 CSS selector, XPath selector, HTML 属性(比如 id
, data-test-id)或者是文本内容
定位元素。
除了xpath selector外,所有selector默认都是指向shadow DOM,如果要指向常规DOM,可使用*:light。不过通常不需要。
# Using data-test-id= selector engine
page.click('data-test-id=foo')
# CSS and XPath selector engines are automatically detected
page.click('div')
page.click('//html/body/div')
# Find node by text substring
page.click('text=Hello w')
# Explicit CSS and XPath notation
page.click('css=div')
page.click('xpath=//html/body/div')
# Only search light DOM, outside WebComponent shadow DOM:
page.click('css:light=div')
# 不同的selector可组合使用,用 >>连接
# Click an element with text 'Sign Up' inside of a #free-month-promo.
page.click('#free-month-promo >> text=Sign Up')
# Capture textContent of a section that contains an element with text 'Selectors'.
section_text = page.eval_on_selector('*css=section >> text=Selectors', 'e => e.textContent')
详细:
Element selectors | Playwright Python
playwright在执行操作之前对元素执行一系列可操作性检查,以确保这些行动按预期运行。它会自动等待(auto-wait)所有相关检查通过,然后才执行请求的操作。如果所需的检查未在给定的范围内通过timeout
,则操作将失败并显示TimeoutError
如 page.click(selector, **kwargs) 和 page.fill(selector, value, **kwargs) 这样的操作会执行auto-wait ,等待元素变成可见(visible)和 可操作( actionable)。例如,click将会:
visibility:hidden
# Playwright waits for #search element to be in the DOM
page.fill('#search', 'query')
# Playwright waits for element to stop animating
# and accept clicks.
page.click('#search')
#也可显示执行等待动作
# Wait for #search to appear in the DOM.
page.wait_for_selector('#search', state='attached')
# Wait for #promo to become visible, for example with `visibility:visible`.
page.wait_for_selector('#promo')
# Wait for #details to become hidden, for example with `display:none`.
page.wait_for_selector('#details', state='hidden')
# Wait for #promo to be removed from the DOM.
page.wait_for_selector('#promo', state='detached')
API:
API page.evaluate(expression, **kwargs) 可以用来运行web页面中的 JavaScript函数,并将结果返回到plarywright环境中。浏览器的全局变量,如 window
和 document,
可用于 evaluate。
href = page.evaluate('() => document.location.href')
# if the result is a Promise or if the function is asynchronous evaluate will automatically wait until it's resolved
status = page.evaluate("""async () => {
response = fetch(location.href)
return response.status
}""")
page.evaluate(expression, **kwargs) 方法接收单个可选参数。此参数可以是Serializable值和JSHandle或ElementHandle实例的混合。句柄会自动转换为它们所代表的值
result = page.evaluate("([x, y]) => Promise.resolve(x * y)", [7, 8])
print(result) # prints "56"
print(page.evaluate("1 + 2")) # prints "3"
x = 10
print(page.evaluate(f"1 + {x}")) # prints "11"
body_handle = page.query_selector("body")
html = page.evaluate("([body, suffix]) => body.innerHTML + suffix", [body_handle, "hello"])
body_handle.dispose()
# A primitive value.
page.evaluate('num => num', 42)
# An array.
page.evaluate('array => array.length', [1, 2, 3])
# An object.
page.evaluate('object => object.foo', { 'foo': 'bar' })
# A single handle.
button = page.query_selector('button')
page.evaluate('button => button.textContent', button)
# Alternative notation using elementHandle.evaluate.
button.evaluate('(button, from) => button.textContent.substring(from)', 5)
# Object with multiple handles.
button1 = page.query_selector('.button1')
button2 = page.query_selector('.button2')
page.evaluate("""o => o.button1.textContent + o.button2.textContent""",
{ 'button1': button1, 'button2': button2 })
# Object destructuring works. Note that property names must match
# between the destructured object and the argument.
# Also note the required parenthesis.
page.evaluate("""
({ button1, button2 }) => button1.textContent + button2.textContent""",
{ 'button1': button1, 'button2': button2 })
# Array works as well. Arbitrary names can be used for destructuring.
# Note the required parenthesis.
page.evaluate("""
([b1, b2]) => b1.textContent + b2.textContent""",
[button1, button2])
# Any non-cyclic mix of serializables and handles works.
page.evaluate("""
x => x.button1.textContent + x.list[0].textContent + String(x.foo)""",
{ 'button1': button1, 'list': [button2], 'foo': None })
其他参考:
Playwright(python)微软浏览器自动化教程(二)_weixin_44043378的博客-CSDN博客