混合的app自动化
When the captcha pops, the algo stops
当验证码弹出时,算法停止
In our prior article Web Automation we covered our main options when it comes to programmatically interact with web-based content, depending on the level of complexity required for the task at hand: from raw transactions, structured content, sessions, browsers, to traffic control via proxies. If you are new to this field I recommend you start there as it provides a step-by-step introduction and plenty of examples in Python.
在上一篇文章Web Automation中,我们讨论了与基于Web的内容进行编程交互时的主要选项,具体取决于手头任务所需的复杂程度:从原始事务 , 结构化内容 , 会话 , 浏览器 到流量控制通过代理 。 如果您是该领域的新手,建议您从此处开始,因为它提供了逐步的介绍和大量Python示例。
In fact, all you need is there… if it wasn’t by the fact that many content owners hate web automation (often because they want to implement discriminative pricing tactics and force you to subscribe to their expensive premium API for these functions) and try to prevent it by any means possible, the most popular of which are the captcha screens in which you are presented with a challenge that requires human cognitive capabilities in order to solve and proceed with your browsing.
实际上,您所需要的就在那……如果不是由于许多内容所有者讨厌Web自动化 (通常是因为他们想实施具有歧视性的定价策略,并迫使您订阅这些功能的昂贵API),然后尝试为了以任何可能的方式阻止它,最流行的是验证码屏幕,在该屏幕中您会遇到挑战,需要人类认知才能解决并继续进行浏览。
更好的捕鼠器 (A Better Mousetrap)
Of course you can deploy countermeasures in your code, both traditional and AI-based, in order to avoid triggering such captchas: throttle the velocity of your actions, space out your requests unevenly, add sporadic mouse movements and clicks… but no matter how carefully the rodent tiptoes through the site, he is bound to eventually fall in one trap or another, because in this cat-and-mouse game the feline is always holding the upper hand (or paw): the precise triggers are unknown, they may change over time, and some sites even implement trigger-less captchas at unexpected times “just in case”, similar to the random TSA screenings that happen at airports.
当然,您可以在代码中部署传统的和基于AI的对策,以避免触发此类验证码: 限制动作的速度 , 不均匀地间隔您的请求,增加零星的鼠标移动和点击……但是无论如何谨慎啮齿动物tip脚穿过该部位,因此他一定会掉入一个陷阱或另一个陷阱中,因为在这种猫和老鼠的游戏中,猫科动物总是握住上手(或爪子):确切的触发因素未知,它们可能会改变随着时间的流逝,有些站点甚至在意外情况下“以防万一”实施了免触发的验证码,类似于在机场进行的随机TSA筛选。
And while the first generation of captchas (e.g. “type the digits that you see in this image”) could be defeated implementing context-specific AI algorithms, the version of captchas widely used nowadays have a level of sophistication (e.g. “find specific objects in these series of images, which change over time”) that makes any attempt at automation too time consuming and impractical.
虽然第一代验证码(例如,“键入您在这张图片中看到的数字”)在实施上下文特定的AI算法时可能会失败,但当今广泛使用的验证码版本具有一定的复杂性(例如,“在图中找到特定对象”这些随时间变化的图像系列”),使任何自动化尝试都非常耗时且不切实际。
So it seems that we are condemned to this motto: when the captcha pops, the algo stops. Our process breaks, our tasks don’t complete and frustration ensues. But there’s an alternative: if the captchas require human cognition… then let’s bring a human into the equation!. That’s what we call Hybrid Web Automation: a system that executes, for the most part, independently but when faced with an unexpected situation (such as a captcha screen) requests the assistance of a human counterpart and waits patiently until all is clear to resume normal operations instead of crashing down on the spot.
因此,似乎我们被谴责了这一座右铭: 当验证码弹出时,算法停止 。 我们的流程中断了,我们的任务没有完成,随之而来的是挫败感。 但是还有另一种选择:如果验证码需要人类认知……那么让我们把人类带入方程式吧! 。 这就是我们所说的混合网络自动化 : 大多数情况下,该系统是独立执行的,但是在遇到意外情况(例如验证码屏幕)时,需要人类同行的帮助,并耐心等待,直到一切恢复正常为止操作,而不是当场崩溃。
示例:自动化Instagram (Example: Automating Instagram)
To make our explanation as practical as possible, we are going to implement hybrid automation in Python to the particular use case of downloading all the pictures of an Instagram profile of our choosing.
为了使我们的解释尽可能实际,我们将针对特定的用例,在Python中实施混合自动化,以下载我们选择的Instagram个人资料的所有图片 。
It is important to remember, though, that “web automation” goes beyond the mere collection of content, and also includes the possibility of interacting with the web pages filling in forms, providing data and activating services. That is, bi-directional autonomous interaction. But for our purposes web scraping is the simplest case to portray, an MVP of sorts.
不过,重要的是要记住,“网络自动化”不仅限于内容的收集 ,还包括与填写表单,提供数据和激活服务的网页进行交互的可能性。 即双向自主交互。 但是出于我们的目的, Web抓取是最简单的描述形式,它是各种MVP。
Essentially what we need to do is to create a wrapper around our methods using the Proxy Design Pattern such that we don’t call them directly but always via the proxy. Thus instead of:
本质上,我们需要做的是使用代理设计模式围绕我们的方法创建一个包装器 ,这样我们就不会直接调用它们,而总是通过代理来调用它们。 因此,代替:
def do_something(param1):
driver.get(param1)
We will write our functionality as:
我们将其功能编写为:
def do_something(param1):
proxy(_do_something, param1)def proxy(fun, param1=None):
try:
return fun(param1)
except:
pass # Manual error handling goes here!def _do_something(param1)
driver.get(param1)
Which reads as: when we request do_something it calls the proxy, which calls the inner _do_something method which in turn executes the required functionality from the browser. Should the task fail at any point, the process rolls back to the proxy where it stops (and calls the human using, for instance, visible and audible signals) until the eventuality is handled.
内容如下:当我们请求do_something时,它调用代理,后者调用内部的_do_something方法,该方法又从浏览器执行所需的功能。 如果任务在任何时候失败,则该过程将回滚到代理,代理将在该代理处停止(并使用可见和可听见的信号来呼叫人员),直到处理了可能的情况为止。
样例代码 (Sample Code)
The complete working code has been published in the same Git repository used in the prior Web Automation article, using Selenium Chrome as our programmatic web driver:
完整的工作代码已经发布在与上一期Web Automation文章相同的Git存储库中,使用Selenium Chrome作为我们的编程Web驱动程序:
https://github.com/isaacdlp/scraphacks
https://github.com/isaacdlp/scraphacks
First we have to login into Instagram. Our code supports both input of credentials as well as retrieval of stored cookies to reutilize sessions (login frequency is often on of the trigger criteria for captchas, and hence we cover our backs by minimizing the need for re-authentication):
首先,我们必须登录Instagram。 我们的代码支持凭证的输入以及存储的cookie的检索,以重用会话(登录频率通常是验证码的触发条件,因此我们通过最大限度地减少重新认证的需求来掩盖我们的知识):
def _login(self, site):
# [...]
elif site == "instagram":
if not self._cookies(site):
self._print("Login to instagram")
self.browser.get("https://www.instagram.com")
self.wait()
login_form = self.browser.find_element_by_css_selector("article form")
login_email = login_form.find_element_by_name("username")
login_email.send_keys(creds["username"])
login_pass = login_form.find_element_by_name("password")
login_pass.send_keys(creds["password"])
login_pass.send_keys(Keys.ENTER)
self.wait(5)
self.browser.find_element_by_css_selector("nav a[href='/%s/']" % creds["username"])
if self.use_cookies:
cookies = self.browser.get_cookies()
with open("%sscrap.cookie" % site, "w") as f:
json.dump(cookies, f, indent=2)
return True
Then, our implementation of the do_something function (called _instagram) is rather simple and can be found below:
然后,我们对do_something函数(称为_instagram )的实现非常简单,可以在下面找到:
def _instagram(self, url):
props = self._base(url)
media = []
try:
self.browser.execute_script("document.querySelector('article a').click()")
while True:
self.wait()
try:
image = self.browser.find_element_by_css_selector("article.M9sTE img[decoding='auto']")
srcset = image.get_attribute("srcset")
srcs = [src.split(" ") for src in srcset.split(",")]
srcs.sort(reverse=True, key=lambda x: int(x[1][:-1]))
src = srcs[0][0]
media.append({"type" : "jpg", "src" : src})
except:
try :
video = self.browser.find_element_by_css_selector("article.M9sTE video")
src = video.get_attribute("src")
media.append({"type": "mpg", "src": src})
except:
pass
try:
self.browser.execute_script("document.querySelector('a.coreSpriteRightPaginationArrow').click()")
except:
break
except:
pass props["Media"] = media
return props
It basically follows this routine:
它基本上遵循以下例程:
- Navigates to the target profile. 导航到目标配置文件。
Traverses all media items sequentially, from last to first (because Instagram, as many other social sites, is built with a Progressive Feed pattern in mind: the older content loads once you scroll down the page).
遍历了所有媒体项目顺序, 从后到前 (因为Instagram,因为许多其他社交网站,是建立与心中一进进的模式:旧的内容加载,一旦你向下滚动页面)。
- Grabs the unique url of each media item (thus decoupling the gathering of items from the actual download, again a strategy for captcha prevention). 抓取每个媒体项目的唯一URL(从而将项目的收集与实际下载脱钩,这也是防止验证码的策略)。
- Returns the list of media items as a “Media” property. 返回媒体项列表作为“媒体”属性。
Please note that besides its simplicity, the example has been extended to be able to handle both images and videos off the shelf.
请注意,除了简单之外,该示例还进行了扩展,可以处理现成的图像和视频 。
可重复使用的混合包装 (Reusable Hybrid Wrapper)
What is most interesting in the example above is that both the login and the scrapping methods use the same proxy function! Namely, this one:
上面的示例中最有趣的是, 登录和剪贴方法都使用相同的代理功能 ! 即,这一个:
def _proxy(self, fun, var = None):
if self.browser:
successful = False
while not successful:
try:
return fun(var)
except Exception as e:
if self.interactive:
exc_type, exc_obj, exc_tb = sys.exc_info()
print("ERROR '%s' at line %s" % (e, exc_tb.tb_lineno))
cmd = self.default_cmd
if not cmd:
props = {"loop": 0}
if self.audible:
thread = threading.Thread(target=self._play, args=(props,))
thread.start()
cmd = input("*(r)epeat, (c)ontinue, (a)bort or provide new url? ")
props["loop"] = self.max_loop
if cmd == "r" or cmd == "":
pass
elif cmd == "c":
successful = True
elif cmd == "a":
raise e
else:
var = cmd
else:
raise e
Reusability is the whole point: the code above handles error gracefully no matter what was the original task at hand, loops a sound a configurable number of times (to call the attention of the human without being annoying in case he is busy with other matters) and presents a command-line prompt with the options of trying the last url, switching to a new url, moving on to the next, or exiting in the case that the error could not be addressed.
可重用性是关键 : 无论手中的原始任务是什么,上面的代码都会优雅地处理错误,使声音循环可配置的次数(以引起人们的注意,而不必为烦恼而烦恼)并显示一个命令行提示符,其中包含尝试最后一个URL,切换到新的URL,移至下一个URL或在无法解决错误的情况下退出的选项。
Even in this last situation (forced exit) implementing hybrid automation also helps enormously as we are still presented with an interactive session at the current browser in order to understand and debug the issue, instead of a crashed session, a closed browser and an ugly exception printout.
即使在最后这种情况下( 强制退出 ), 实现混合自动化也有很大帮助,因为我们仍在当前浏览器中看到一个交互式会话 ,以了解和调试问题,而不是崩溃的会话,关闭的浏览器和丑陋的异常打印。
Furthermore, we can encapsulate our proxy into a Python object to keep the code and its actions portable across different web automation projects. That is precisely what we have done in the scrapper folder of our demo:
此外,我们可以将代理封装到Python对象中,以使代码及其操作可跨不同的Web自动化项目移植。 这正是我们在演示的scrapper文件夹中所做的:
https://github.com/isaacdlp/scraphacks/tree/master/scrapper
https://github.com/isaacdlp/scraphacks/tree/master/scrapper
Now the requirements to complete our particular Instagram example are simplified a great deal: first call the object we just created, and then download the specific media items as we see fit. The code we showcase below can be found in the socialscrap.py file:
现在,完成我们特定的Instagram示例的要求已大大简化:首先调用我们刚刚创建的对象,然后根据需要下载特定的媒体项目。 我们在下面展示的代码可以在socialscrap.py文件中找到:
https://github.com/isaacdlp/scraphacks/blob/master/socialscrap.py
https://github.com/isaacdlp/scraphacks/blob/master/socialscrap.py
from scrapper import *
import requests as req
target = "isaacdlp"folder = "download/%s" % target
if not os.path.exists(folder):
os.mkdir(folder)
scrapper = Scrapper()
try:
scrapper.start()
scrapper.login("instagram")
props = scrapper.instagram("https://www.instagram.com/%s" % target)
finally:
scrapper.stop()
for i, prop in enumerate(props["Media"], start = 1):
res = req.get(prop["src"])
if res.status_code != 200:
break
with open("%s/%s-%s.%s" % (folder, target, i, prop["type"]), 'wb') as bout:
bout.write(res.content)
Beyond that, please check the __init__.py file for more details on the wrapper implementation. It includes other advanced functions as generalized cookie handling, page scrolling, and screenshot captures of websites. Feel free to make the code your own, adapt it to your specific needs and extend it to support other use cases.
除此之外,请检查__init__.py文件以获取有关包装程序实现的更多详细信息。 它包括其他高级功能,例如通用cookie处理,页面滚动以及网站的屏幕快照捕获 。 随意制作自己的代码,使其适应您的特定需求,并将其扩展以支持其他用例。
You are most welcome!
不用客气!
翻译自: https://medium.com/algonaut/hybrid-web-automation-c3b3e8700d0a
混合的app自动化