It often happens that you come across a website and are forced to perform a set of actions to finally get some data. You are then faced with a dilemma: how do you make this data available in a form which can easily be consumed by your application?
通常,您访问一个网站并被迫执行一系列操作以最终获取一些数据。 然后,您将面临一个难题:如何以易于应用程序使用的形式提供这些数据?
Scraping comes to the rescue in such a case. And selecting the right tool for the job is quite important.
在这种情况下,便可以进行报废。 选择正确的工作工具非常重要。
Puppeteer is a Node.js library maintained by the Chrome Devtools Team at Google. It basically runs a Chromium or Chrome (perhaps the more recognizable name) instance in a headless (or configurable) manner and exposes a set of high-level APIs.
Puppeteer是Google的Chrome Devtools小组维护的Node.js库。 它基本上以无头(或可配置)的方式运行Chromium或Chrome(也许更易于识别的名称)实例,并公开了一组高级API。
From its official documentation, puppeteer is normally leveraged for multiple processes which are not limited to the following:
从其官方文档中 ,puppeteer通常用于多个过程,而不仅限于以下过程:
For our case, we need to be able to access a website and map the data in a form which can be easily consumed by our application.
对于我们来说,我们需要能够访问网站并以易于应用程序使用的形式映射数据。
Sounds simple? The implementation is not that complex, either. Let's start.
听起来很简单? 实现也不是那么复杂。 开始吧。
My fondness for Amazon products prompts me to use one of their product listing page as a sample here. We will implement our use case in two steps:
我对Amazon产品的爱好促使我在此处使用其产品列表页面之一作为示例。 我们将分两步实施用例:
You can find the complete code in this repository.
您可以在此存储库中找到完整的代码。
We will be extracting the data from this link: https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 ( a listing of the top searched shirts as shown in the image) in an API servable form.
我们将从此链接中提取数据:以API形式提供的https://www.amazon.in/s?k=Shirts&ref=nb_sb_noss_2 (如图中所示,是搜索最多的衬衫的清单)。
Before we get started using puppeteer extensively in this section, we need to understand the two primary classes provided by it.
在本节中开始广泛使用puppeteer之前,我们需要了解它提供的两个主要类。
Browser: launches a Chrome instance when we use puppeteer.launch
or puppeteer.connect
. This works as a simple browser emulation.
浏览器:当我们使用puppeteer.launch
或puppeteer.connect
时,启动一个Chrome实例。 这可以用作简单的浏览器仿真。
Page: resembles a single tab on a Chrome browser. It provides an exhaustive set of methods you can use with a particular page instance and is invoked when we call browser.newPage
. Just like you can create multiple tabs in the browser, you can similarly create multiple page instances at a single time in puppeteer.
页面:类似于Chrome浏览器上的单个标签。 它提供了可用于特定页面实例的详尽方法集,并在我们调用browser.newPage
时被调用。 就像您可以在浏览器中创建多个选项卡一样,您可以在puppeteer中一次创建多个页面实例。
We start setting up puppeteer by using the npm module provided. After installing puppeteer, we create an instance of the browser and the page class and navigate to the target URL.
我们开始使用提供的npm模块来设置puppeteer。 安装puppeteer之后,我们创建浏览器和页面类的实例,然后导航到目标URL。
We use networkidle2
as the value for the waitUntil
option while navigating to the URL. This ensures that the page load state is considered final when it has no more than 2 connections running for at least 500ms.
在导航到URL时,我们将networkidle2
用作waitUntil
选项的值。 这样可以确保页面加载状态在运行至少500ms时不超过2个连接时被视为最终状态。
Note: You do not need to have Chrome or an instance of it installed on your system for puppeteer to work. It already ships with a lite version of it bundled with the library.
注意:您不需要在系统上安装Chrome或Chrome的实例,即可使用puppeteer。 它已经附带了与库捆绑在一起的精简版。
The DOM has already loaded in the page instance created. We will go ahead and leverage the page.evaluate()
method to query the DOM.
DOM已加载到创建的页面实例中。 我们将继续使用page.evaluate()
方法来查询DOM。
Before we start, we need to figure out the exact data-points we need to extract. In the current sample, each of the product objects will look something like this.
在开始之前,我们需要弄清楚需要提取的确切数据点。 在当前示例中,每个产品对象都将类似于以下内容。
We have laid out the structure we want to achieve. Time to start inspecting the DOM for the identifiers. We check for the selectors that occur throughout the items to be mapped. We will mostly use document.querySelector
and document.querySelectorAll
for traversing the DOM.
我们已经列出了我们想要实现的结构。 是时候开始检查DOM的标识符了。 我们检查在要映射的项目中出现的选择器。 我们将主要使用document.querySelector
和document.querySelectorAll
遍历DOM。
// traverse for brand and product names
//遍历品牌和产品名称
After investigating the DOM, we see that each listed item is enclosed under an element with the selector div[data-cel-widget^="search_result_"]
. This particular selector seeks out all div
tags with the attribute data-cel-widget
that have a value starting with search_result_
.
在研究了DOM之后,我们看到每个列出的项目都用选择器div[data-cel-widget^="search_result_"]
包围在一个元素下。 这个特定的选择器查找所有具有data-cel-widget
属性的div
标签,这些标签的值以search_result_
开头。
Similarly, we map out the selectors for the parameters we require as listed. If you want to learn more about DOM traversal, you can check out this informative article by Zell.
同样,我们列出了所需参数的选择器。 如果您想了解有关DOM遍历的更多信息,可以查看Zell 撰写的这篇内容丰富的文章 。
total listed items: div[data-cel-widget^="search_result_"]
列出的项目总数: div[data-cel-widget^="search_result_"]
brand: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-base
(i
stands for the node number in total listed items
)
品牌: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-base
( i
代表total listed items
的节点号)
product: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-base
or div[data-cel-widget="search_result_${i}"] .a-size-medium.a-color-base.a-text-normal
(i
stands for the node number in total listed items
)
产品: div[data-cel-widget="search_result_${i}"] .a-size-base-plus.a-color-base
或div[data-cel-widget="search_result_${i}"] .a-size-medium.a-color-base.a-text-normal
( i
代表total listed items
的节点号)
url: div[data-cel-widget="search_result_${i}"] a[target="_blank"].a-link-normal
(i
stands for the node number in total listed items
)
网址: div[data-cel-widget="search_result_${i}"] a[target="_blank"].a-link-normal
( i
代表total listed items
的节点号)
image: div[data-cel-widget="search_result_${i}"] .s-image
(i
stands for the node number in total listed items
)
图片: div[data-cel-widget="search_result_${i}"] .s-image
( i
代表total listed items
的节点编号)
price: div[data-cel-widget="search_result_${i}"] span.a-offscreen
(i
stands for the node number in total listed items
)
价格: div[data-cel-widget="search_result_${i}"] span.a-offscreen
( i
代表total listed items
的节点号)
Note: We wait for
div[data-cel-widget^="search_result_"]
selector named elements to be available on the page by using thepage.waitFor
method.注意:通过使用
page.waitFor
方法,我们等待div[data-cel-widget^="search_result_"]
选择器命名的元素在页面上可用。
Once the page.evaluate
method is invoked, we can see the data we require logged.
调用page.evaluate
方法后,我们可以看到需要记录的数据。
So far we are able to navigate to a page, extract the data we need, and transform it into an API-ready form. That sounds all hunky-dory.
到目前为止,我们已经能够导航到页面,提取所需的数据,并将其转换为支持API的形式。 听起来很笨拙。
However, consider for a moment a case where you have to navigate to one URL from another by performing some actions – and then try to extract the data you need.
但是,请考虑一下您必须通过执行某些操作从另一个URL导航到一个URL的情况,然后尝试提取所需的数据。
Would that make your life a little trickier? Not at all. Puppeteer can easily imitate user behavior. Time to add some automation to our existing use case.
这会使您的生活更加棘手吗? 一点也不。 木偶可以轻松地模仿用户行为。 是时候为我们现有的用例添加一些自动化了。
Unlike in the previous example, we will go to the amazon.in
homepage and search for 'Shirts'. It will take us to the products listing page and we can extract the data required from the DOM. Easy peasy. Let's look at the code.
与前面的示例不同,我们将转到amazon.in
主页并搜索“衬衫”。 它将带我们到产品列表页面,我们可以从DOM中提取所需的数据。 十分简单。 让我们看一下代码。
We can see that we wait for the search box to be available and then we add the searchTerm
passed using page.evaluate
. We then navigate to the products listing page by emulating the 'search button' click action and exposing the DOM.
我们可以看到,我们等待搜索框可用,然后我们添加searchTerm
使用传递page.evaluate
。 然后,我们通过模仿“搜索按钮”点击动作并公开DOM导航到产品列表页面。
The complexity of automation varies from use case to use case.
自动化的复杂性因用例而异。
Puppeteer's API is pretty comprehensive but there are a few gotchas I came across while working with it. Remember, not all of these gotchas are directly related to puppeteer but tend to work better along with it.
Puppeteer的API非常全面,但是在使用它时遇到了一些麻烦。 请记住,并非所有这些陷阱都与木偶戏直接相关,但往往会更好地配合使用。
Puppeteer creates a Chrome browser instance as already mentioned. However, it is likely that some existing websites might block access if they suspect bot activity. There is this package called user-agents
which can be used with puppeteer to randomize the user-agent for the browser.
如前所述,Puppeteer创建了一个Chrome浏览器实例。 但是,如果某些现有网站怀疑机器人活动,则可能会阻止访问。 有一个名为“ user-agents
程序包,可以与puppeteer一起使用,以随机化浏览器的用户代理。
Note: Scraping a website lies somewhere in the grey areas of legal acceptance. I would recommend using it with caution and checking rules where you live.
注意:爬网网站位于法律认可的灰色区域。 我建议谨慎使用它并检查您居住的地方的规则。
We came across defaultViewport: null
when launching our Chrome instance and I had listed it as optional. This is because it comes in handy only when you are viewing the Chrome instance being launched. It prevents the website's width and height from being affected when it is rendered.
我们在启动Chrome实例时遇到defaultViewport: null
,我将其列为可选实例。 这是因为只有在查看正在启动的Chrome实例时,它才派上用场。 它可以防止网站的宽度和高度在渲染时受到影响。
Remember to always end a puppeteer session by closing the Browser instance by using browser.close
. (I happened to miss out on it in the first try) It helps end a running Browser Session.
请记住,始终通过使用browser.close
关闭Browser实例来结束伪造者会话。 (我在第一次尝试中碰巧错过了它)它有助于结束正在运行的Browser Session。
Certain common JavaScript operations like console.log()
will not work within the scope of the page methods. The reason being that the page context/browser context differs from the node context in which your application is running.
某些常见JavaScript操作(例如console.log()
将不在页面方法的范围内工作。 原因是页面上下文/浏览器上下文与运行应用程序的节点上下文不同。
These are some of the gotchas I noticed. If you have more, feel free to reach out to me with them. I would love to learn more.
这些是我注意到的一些陷阱。 如果您还有更多内容,请随时与他们联系。 我想了解更多。
Done? Let's run the application.
做完了吗 让我们运行该应用程序。
The application is run in non-headless mode so you can witness what exactly happens. We will automate the navigation to the product listing page from which we obtain the data.
该应用程序以非无头模式运行,因此您可以见证发生了什么。 我们将自动导航到从中获取数据的产品列表页面。
There. You have your own API consumable data setup from the website of your choice. All you need to do now is to wire this up with a server side framework like express
and you are good to go.
那里。 您可以从自己选择的网站上设置自己的API消耗数据。 您现在所要做的就是将其与服务器端框架(如express
,一切顺利。
There is so much you can do with Puppeteer. This is just one particular use case. I would recommend that you spend some time to read the official documentation. I will be doing the same.
Puppeteer可以做很多事情。 这只是一个特定的用例。 我建议您花一些时间阅读官方文档。 我会做同样的。
Puppeteer is used extensively in some of the largest organizations for automation tasks like testing and server side rendering, among others.
在一些最大的组织中,Puppeteer被广泛用于自动化任务,例如测试和服务器端渲染等。
There is no better time to get started with Puppeteer than now.
没有比现在更好的时间开始使用Puppeteer。
If you have any questions or comments, you can reach out to me on LinkedIn or Twitter.
如果您有任何疑问或意见,可以在LinkedIn或Twitter上与我联系。
In the meantime, keep coding.
同时,继续编码。
翻译自: https://www.freecodecamp.org/news/create-api-website-using-puppeteer/