使用有限的重试循环改进您的Web刮板程序— Python

Mastering exception-handling is of pivotal importance for producing clean and stable Python code. Chances are high you’re already aware of that, as most Python books geared towards newcomers to the language — and often to coding in general — make sure to spend a few paragraphs on the subject.

掌握异常处理对于生成干净且稳定的Python代码至关重要。 您已经意识到这一点的机率很高,因为大多数Python书籍都针对该语言的新手(通常是一般的编码),因此请确保在该主题上花一些段落。

In essence, exception-handling means you’re providing an alternative action when a specific piece of code (usually a single line) throws an error for whatever reason. The attentive coder incorporates exceptions in their script for the same reason they use regex patterns for string matching: your code should harbor as few assumptions as possible. An operation on one type of dataset might not work for a slightly different kind of dataset, and what works today might not work tomorrow. Exception statements are therefore the last safeguard against breaking your code.

本质上, 异常处理意味着当特定代码段(通常为一行)出于任何原因引发错误时,您将提供替代操作。 细心的编码器将异常合并到脚本中,其原因与使用正则表达式模式进行字符串匹配的原因相同: 您的代码应包含尽可能少的假设 。 对于一种稍微不同的数据集,对一种类型的数据集执行的操作可能无法使用,而今天有效的操作明天可能会无效。 因此,异常语句是防止破坏代码的最后安全措施。

The try-except-else-finally clause is a classic and stamped into every aspirant-Python aficionado from day one:

try-except-else-finally子句是经典的,从第一天起就印在每个有抱负的Python迷中:

Try:#Whatever you wish to executeExcept:#If this throws an error, do the following. Usually the user implements either one of these three steps: 
(a) throw an exception (which breaks the code) or
(b) perform an alternative action (e.g. record that the try-statement produced an error, but continue with the code anyway)
(c) pass (simply execute the rest of the script)Else:#If this doesn't throw an error, do the followingFinally:#The finally-statement is optional. Whether the try-statement produced an error or not, do the following.

Exception-handling is especially important for web scraping. After all, there are plenty of reasons why your scraper might break unexpectedly, such as:

异常处理对于Web抓取尤其重要。 毕竟,刮板可能会意外断裂的原因有很多,例如:

  • A particular page does not exist. Many scrapers have built-in assumptions revolved around URL-structure. For example, for a scraper I wrote on collecting info on Android apps, the code assumes that filling in the app-name in the following URL-structure will result in a 201 response code (i.e. connection succeeded): ‘https://play.google.com/

    特定页面不存在。 许多抓取工具具有围绕URL结构的内置假设。 例如,对于我写的关于在Android应用程序上收集信息的刮板,代码假定在以下URL结构中填写应用程序名称将导致201响应代码(即连接成功):'https:// play .google.com /

    store/apps/details?id=

    商店/应用/详细信息?id =

    appname’. Although this is true in most cases (go ahead and fill in ‘com.nianticlabs.pokemongo’), some apps might get deleted from the app store or are downloaded from different repositories in the first place. If that’s the case, I instructed the script to try out an alternative platform (e.g. APKMonk).

    appname '。 尽管在大多数情况下是这样(继续并填写'com.nianticlabs.pokemongo'),但某些应用可能会从应用商店中删除或首先从其他存储库下载。 如果是这样,我指示脚本尝试其他平台(例如APKMonk)。

  • A particular web element does not (always) exist.For example, while scraping some news websites; it’s possible that only some articles will include a subheading, hidden in a particular tag-attribute combination. One could simply check whether a subheading is present or not (try), skip the scraping of the headline or add an error message if it’s not available(except) and scrape the headline if possible (else).

    特定的Web元素不存在(始终)。 例如,在抓取一些新闻网站时; 有可能只有一些文章会包含隐藏在特定标签-属性组合中的子标题。 可以简单地检查是否存在子标题(请尝试 ),跳过标题的刮擦,或者在不可用时添加错误消息( 除外 ),并在可能的情况下刮擦标题( 否则 )。

  • Lost internet connection or website outages. Although this can be due to a lousy internet connection, it’s also a factor to consider when you’re performing IP rotation in one way or another, by using proxies or rotating between different VPN servers for example. In these cases, it sometimes takes a little while for the connection to be established. However, sometimes the problem lies not with the client byt with the server itself. For instance, some websites are notorious for their frequent outages due to heavy server loads (e.g. Reddit).

    互联网连接丢失或网站中断。 尽管这可能是由于互联网连接不畅造成的,但是当您以一种或另一种方式(例如,通过使用代理或在不同的VPN服务器之间旋转)执行IP旋转时,这也是要考虑的一个因素。 在这些情况下,建立连接有时会花费一些时间。 但是,有时问题不在于服务器本身的客户端问题。 例如,一些网站因服务器负载过重(例如Reddit)而频繁中断而臭名昭著。

  • IP-block. Many websites want to ward off scrapers since they can increase server load (if used aggressively). Other platforms consider scraping as theft and are therefore outright against automated data collection no matter how slow the scraping process. For this reason, many platforms have some kind of bot-detection in place, giving them the ability to block your IP — at least for a little while.

    IP块。 许多网站都希望避免使用刮板程序,因为它们可能会增加服务器负载(如果使用起来过猛)。 其他平台将抓取视为盗窃,因此无论抓取过程有多慢,都完全反对自动数据收集。 因此,许多平台都具有某种类型的僵尸检测功能,从而使它们至少有一段时间能够阻止您的IP。

The first two instances call for a classic try-except clause. For instance, while scraping the Metacritic-platform using the well-known Beautifulsoup package, I noticed that most outlets on their review page are represented by logos (usually the more prestigious ones like The New York Times), while others are represented by a simple text attribute. The following piece of code tries to catch the alternate title of the image (e.g. “The New York Times”) for each available review of a specific product (try). If this results in an error, the script reasons that the outlet must be represented as simple text and subsequently tries to catch the first ‘a’ tag within the review (except). In the end, the outlet is added to our outlet-list, no matter whether the outlet is represented by an image or text (finally).

前两个实例需要经典的try-except子句。 例如,在使用著名的Beautifulsoup软件包抓取Metacritic平台时,我注意到其评论页面上的大多数网点都由徽标代表(通常是像《纽约时报》这样享有盛名的网店),而其他网点则由简单的徽标代表。文字属性。 以下代码段尝试为特定产品(尝试)的每个可用评论捕获图像的备用标题(例如“纽约时报” ) 。 如果这导致错误,则脚本会导致必须将出口表示为简单文本,并随后尝试捕获审阅中的第一个'a'标签(除外) 。 最后,无论插座是由图像还是文字(最终)表示,插座都将添加到我们的插座列表中。

The latter two causes are a different story. They do not throw an error because you wrongfully assumed a particular page or element exist, but simply because there is some kind of external factor influencing your ability to connect to the web. Most importantly, they represent (possibly) temporary outages: it might be a matter of seconds before your internet connection or the website itself is back up and running and a platform usually lifts the IP-block after you patiently wait for fifteen minutes or so. This asks for a different approach, since throwing an except right away might be too drastic; it’s worth trying again for a couple of times before discarding the whole process. In other words: you need a finite or limited retry-loop.

后两个原因是不同的故事。 它们不会引发错误,因为您错误地认为存在特定的页面或元素,而仅仅是因为某种外部因素会影响您连接到Web的能力。 最重要的是,它们表示(可能是)暂时的中断:您可能需要几秒钟的时间才能连接Internet,或者网站本身已恢复正常运行,并且通常在耐心等待15分钟左右后,平台会抬起IP块。 这要求一种不同的方法,因为立即抛出一个except可能太过激烈了。 在丢弃整个过程之前,值得再次尝试几次。 换句话说:您需要一个有限或有限的重试循环。

In this kind of loop, the except will trigger a short pause before executing the try-part again. It will do this until it has exhausted its number of retries. After that, the coder can still choose to either pass to the following chunk of code, throw an error or append some kind of error message and continue with executing the remainder of the script. This last part is crucial, as it differs from a (possibly) infinite while-loop as in the following piece of code:

在这种循环中, except将触发短暂的暂停,然后再次执行try-part。 它将执行此操作,直到用尽重试次数。 之后,编码人员仍然可以选择传递到以下代码块,引发错误或附加某种错误消息,然后继续执行脚本的其余部分。 最后一部分至关重要,因为它不同于下面的代码中的(可能)无限while循环

output = None
while output is None:
try:
#Do the following
except:
pass

This will let your script run in an infinite loop if it’s unable to execute the try-part, which — in a sense — is just as script-breaking as not incorporating the try-part at all.

如果无法执行try-part,这将使您的脚本无限循环运行,从某种意义上说,这与完全不包含try-part一样,破坏了脚本。

The most elegant and efficient way of constructing a limited retry-loop is by combining a for-loop, a try-except and the continue statement all in one go. The template looks something like this:

构造有限的重试循环的最优雅,最有效的方法是一次性组合使用for循环,try-except和continue语句。 该模板如下所示:

for attempt in range(n):
try:
#Do the following
except:
time.sleep(n2)
continue
else:
break
else:
#Do the following if after n tries the try-part throws an error

The continue-statement combined with the else-clause outside the loop are key here. The continue-statement sends the code back to the beginning of the loop (i.e. the try-part). Before it does that, though, it pauses the script for n2 seconds, giving your internet connection and/or the website a chance to go back online. Let’s call this template the for-catch loop, since the for-try-except-else-else clause doesn’t sound all that catchy.

继续语句与循环之外的else子句在这里是关键。 继续语句将代码发送回循环的开头(即try-part)。 但是,在此之前,它将暂停脚本n2秒,使您的Internet连接和/或网站有机会重新联机。 让我们将此模板称为for-catch循环 ,因为for-try-except-else-else子句听起来并不那么吸引人。

Using the for-catch loop was a real game-changer for me, especially for avoiding temporary IP-blocks. Let’s go back to the Metacritic (MC)-scraper I mentioned earlier. I collected the review overview pages of around 1000 movies (as an example, here’s the review page of Son of Saul). The aim was to scrape all the reviews (score+outlet) from the 1000 URLs. However, MC isn’t exactly keen on tolerating my scraper on its platform, even though I made sure to include plenty of pauses into my script to avoid sending a thousand requests in a minute or so. But alas, MC ruthlessly blocks your ip from time to time, and it’s impossible to predict when and why a block will be triggered. I tried to use some proxies, but to no avail. So I came up with the following for-catch loop to avoid breaking my script:

对我而言,使用for-catch循环是真正的游戏规则改变者,尤其是对于避免临时IP阻止而言。 让我们回到我前面提到的Metacritic(MC)刮板。 我收集了大约1000部电影的评论概述页面(例如,这是Son of Saul的评论页面 )。 目的是从1000个网址中抓取所有评论(得分+出口)。 但是,尽管我确保在脚本中包含很多暂停以避免在一分钟左右的时间内发送数千个请求,但MC并不十分愿意在其平台上容忍我的刮板。 但是,MC,MC会无情地不时地阻止您的IP,并且无法预测何时以及为什么会触发阻止。 我尝试使用一些代理,但无济于事。 因此,我提出了以下for-catch循环以避免破坏脚本:

So this is what the for-catch loop does in this instance:

因此,这是在这种情况下for-catch循环的作用:

  1. Start the first try from range 0–3 (for).

    开始第一次尝试从0-3的范围内( 对于 )。

  2. Sleep for 10–15 seconds and try to access the Metacritic URL (try)

    睡眠10到15秒,然后尝试访问Metacritic URL( try )

  3. If this doesn’t work (except), display a warning message and let the script sleep somewhere between 4 and 6 minutes. After that, execute the try-part again. Do this until the loop is finished (from range 0–3; so a maximum of four tries or three retries, whatever you want to call it) or until the try-part works.

    如果这不起作用( 除外 ),则显示警告消息,并使脚本在4到6分钟之间进入Hibernate状态。 之后,再次执行try-part。 执行此操作,直到循环完成(范围为0–3;因此最多可以进行四次尝试或三次重试,无论您想要调用什么), 或者直到try-part起作用为止。

  4. If the connection is successful (else), break the loop. This means the code will execute the last line (i.e. Creating a Beautifulsoup object, the end goal of this loop)

    如果连接成功( 否则 ),请中断循环。 这意味着代码将执行最后一行(即,创建Beautifulsoup对象,此循环的最终目标)

  5. If the script was able to complete the entire loop — which is bad news in this case — execute the else clause outside the loop and throw an exception (“Something really went wrong here…I’m sorry.)

    如果脚本能够完成整个循环(在这种情况下这是个坏消息),请在循环外执行else子句并引发异常(“这里确实出了点问题,对不起。)

Since I scraped the reviews of around thousand movies in total from an equal amount of pages, I really needed a script that I could just execute and go about my day without worrying about the ip blocks it would surely bump into. And this is what the for-catch loop afforded me: a peace of mind. It worked its magic for my MC scraper: the ip-blocks were always lifted after a couple of minutes. For some other platforms I had to experiment and come up with more extreme sleep ranges (e.g. between 10 and 15 minutes), but the for-catch always pulled the trick.

由于我从相等的页面上总共刮取了大约数千部电影的评论,因此我确实需要一个脚本,我可以执行该脚本来完成一天的工作,而不必担心它肯定会碰到的ip块。 这就是捕捞循环带给我的:内心的平静。 它为我的MC刮板发挥了神奇的作用:几分钟后,ip块总是被提起。 对于其他一些平台,我不得不进行实验并提出更多的极端睡眠范围(例如10至15分钟),但是对于追赶者总是有窍门。

When I showed this to a friend of mine, he was bewildered by the outer else-clause, which actually functions as an except here. The confusion is understandable, since we’re used to interpret else clauses as part of a try-except or if-clause. However, the else-clause behaves similarly as in the aforementioned classic try-except-else-finally structure: the else-clause in that case is also executed when the try-part ‘has runs its course’ (and did not throw an error). Thus, the else-clause is always executed when a for-loop has exhausted its iterations, that’s it. In this case, we obviously hope we can break out of the loop before it ever finishes (i.e. before we used up all our retries).

当我向我的一个朋友展示此内容时,他对其他的子句感到困惑,该子句实际上在这里充当例外 。 混淆是可以理解的,因为我们习惯将else子句解释为try-except或if-clause的一部分。 但是,else子句的行为与上述经典的try-except-else-finally结构相似:在这种情况下,else子句也会在try部件“已经运行”时执行(并且不会抛出错误) )。 因此, else子句总是在for循环用尽其迭代时执行 ,仅此而已。 在这种情况下,我们显然希望我们能跳出循环以往任何时候都完成之前(即我们使用了之前所有的重试)。

The for-catch loop has helped me in unexpected ways as well and has often served me as a less time-wasting and script-breaking alternative to the while clause.

for-catch循环也以出乎意料的方式帮助了我,并且经常使我成为while子句的一种省时又省脚本的替代方法。

Take the following example from a script I wrote for switching between NordVPN servers on Linux or Windows (available on Github right here). Somewhere within the script, I wanted to:

从我编写的用于在Linux或Windows上的NordVPN服务器之间切换的脚本中获取以下示例(可在此处的 Github 上获得 )。 我想在脚本中的某个位置:

  • Fetch and display the current ip

    获取并显示当前 ip

  • Connect to a new server

    连接到新服务器
  • Fetch and display the new ip

    获取并显示新的 IP

Although this seems simple, the third part can be somewhat tricky for two reasons:

尽管这看起来很简单,但是由于以下两个原因,第三部分可能有些棘手:

  1. Even after connecting to a new server, it can take a little while (especially on Windows) before you can successfully request your new ip. So if the NordVPN app is still busy switching servers, you’ll get a connection error. At the same time though, you don’t want to perform the new ip-request too early as well, since the odds are relatively high that you’re actually still browsing the web through your old NordVPN server. In that case, you’re you’re just requesting the old ip.

    即使连接到新服务器后,也可能需要一段时间(尤其是在Windows上),才能成功请求新IP。 因此,如果NordVPN应用程序仍在忙于切换服务器,则会出现连接错误。 但是同时,您也不想过早执行新的ip请求,因为相对而言您实际上仍在通过旧的NordVPN服务器浏览网页的几率很高。 在这种情况下,您就是在请求旧IP。

    This means we’ll need a for-catch loop with an additional check whether the ip requested is different from the previous ip.

    这意味着我们需要一个for-catch循环,并另外检查所请求的ip是否与先前的ip不同。

  2. Sometimes (again, only on Windows) the NordVPN app gets stuck and you need to reconnect to a different server.

    有时(同样,仅在Windows上),NordVPN应用卡住了,您需要重新连接到其他服务器。

    This means we’ll need another for-catch loop.

    这意味着我们将需要另一个for-catch循环。

So essentially we end up with a for-catch loop within another for-catch loop (catch-ception!). The simplified version of the code I actually used looks something like this:

因此,从本质上讲,我们在另一个for-catch循环(catch-ception!)内以for-catch循环结束。 我实际使用的简化代码看起来像这样:

This code snippet avoids an infinite loop, incorporating multiple retries, and at the same time avoids wasting time. The script does a total of 12 retries (see line 17) to fetch a new ip. The first one or two tries will inevitably be too early, resulting in the same ip (new_ip == current_ip) and the script will pause for 5 seconds before retrying. However, as soon as a new ip is successfully requested, the for-catch breaks. If there’s still no new ip after a minute (5 seconds *12), the else clause will let it be (pass, line 30), but than the script gets caught up in another for-catch clause (although I opted for an if-clause instead of an except-catch (see line 31)). If the ip hasn’t changed, the script sleeps for 10 seconds and tries to connect to a new server again (for a maximum of 5 times, see line 14).

此代码段避免了无限循环,并包含多个重试,并且同时避免了浪费时间。 该脚本总共进行了12次重试(请参见第17行)以获取新的IP。 第一次或两次尝试将不可避免地为时过早,导致使用相同的ip(new_ip == current_ip),脚本将暂停5秒钟,然后重试。 但是,一旦成功请求新的IP,for-catch就会中断。 如果一分钟(5秒* 12)后仍然没有新的ip,则else子句将其保留为( pass ,第30行),但是脚本却陷入了另一个for-catch子句中(尽管我选择了if -clause而不是except-catch(请参见第31行)。 如果ip尚未更改,脚本将Hibernate10秒钟,然后尝试再次连接到新服务器(最多5次,请参阅第14行)。

I hope I demonstrated the usefulness of the for-catch loop and why it is especially helpful for many web scraper applications. As a more flexible alternative to the while statement, it allows the script a finite number of retries, allowing you to pause the scraping process if necessary and do whatever you want when the script has exhausted its number of retries.

我希望我展示了for-catch循环的用处,以及为什么它对许多Web爬虫应用程序特别有用。 作为while语句的一种更灵活的替代方法,它允许脚本进行一定数量的重试,从而使您可以在必要时暂停抓取过程,并在脚本用完其重试次数后执行所需的任何操作。

Happy scraping!

刮刮乐!

翻译自: https://medium.com/swlh/improve-your-web-scraper-with-limited-retry-loops-python-35e21730cbf5

你可能感兴趣的:(java,数据结构,小程序)