woshizn

基于网络爬虫的有效URL缓存(英文原文）

Efficient URL Caching for World Wide Web Crawling
Andrei Z. Broder
IBM TJ Watson Research Center
19 Skyline Dr
Hawthorne, NY 10532
[email protected]
Marc Najork
Microsoft Research
1065 La Avenida
Mountain View, CA 94043
[email protected]
Janet L. Wiener
Hewlett Packard Labs
1501 Page Mill Road
Palo Alto, CA 94304
[email protected]
ABSTRACT
Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.
    A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
1. INTRODUCTION
A recent Pew Foundation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.
    Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is
    (a) Fetch a page
    (b) Parse it to extract all linked URLs
    (c) For all the URLs not seen before, repeat (a)–(c)
    Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as yahoo.com, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.
    If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]).
    However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors.
    1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.
    2. Web pages are changing rapidly. If “change” means “any change”, then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17].
    These two factors imply that to obtain a reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally.
    A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently.
    The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.
2. CRAWLING
Web crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.
    The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.
    The original Google crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.
    For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4.
    Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such as its top-level domain).
    The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.
    UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multiple independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.
    Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded.
    Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult the disk-based set.
    Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper.
3. CACHING
In most computer systems, memory is hierarchical, that is, there exist two or more levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right circumstances, caching greatly improves the performance of the overall system and hence it is a fundamental technique in the design of operating systems, discussed at length in any standard textbook [21, 37]. In the web context, caching is often mentioned
in the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.
    Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of the caching algorithm is to minimize the number of misses.
    Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:
    _ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instance, a link to yahoo.com is a much more common occurrence than a link to the authors’ home pages.
    _ Temporal correlation or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latter terminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf. Section 4.2, and second, because pages on a given host tend to be explored sequentially and they tend to share many links. For example, many pages on a Computer Science department server are likely to share links to other Computer Science departments in the world, notorious papers, etc.
    Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.
    We now describe some standard caching algorithms, whose performance we evaluate in Section 5.
3.1 Infinite cache (INFINITE)
    This is a theoretical algorithm that assumes that the size of the cache is larger than the number of distinct requests.
3.2 Clairvoyant caching (MIN)
    More than 35 years ago, L´aszl´o Belady [2] showed that if the entire sequence of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance.
3.3 Least recently used (LRU)
    The LRU algorithm evicts the item in the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefore the number of misses will be minimized in the spirit of Belady’s algorithm.
    Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. This queue has a processing time cost and a memory cost. The latter is usually ignored in caching situations where the items are large.
3.4 CLOCK
    CLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves sequentially through the array, turning the mark bits off, until an unmarked location is found. The cache item corresponding to the unmarked location is evicted and replaced by X.
3.5 Random replacement (RANDOM)
    Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.
    In most practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Section 6.
3.6 Static caching (STATIC)
    If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.
    There are two issues with this approach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in most practical situations.
    In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the current crawl. (There are also efficient techniques for discovering the most popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a first pass over our input to determine the k most popular URLs, and then preload the cache with these URLs, which is what we did in our experiments.
    The second issue above is the very reason we decided to test STATIC: if STATIC performs well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests are highly correlated.
4. EXPERIMENTAL SETUP
    We now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl.
4.1 Mercator crawler architecture
    A Mercator crawling system consists of a number of crawling processes, usually
running on separate machines. Each crawling process is responsible for a subset of all web servers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.
    Each worker thread repeatedly performs the following operations: it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for short); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic properties. For example, it discards all URLs belonging to web servers that contacted us and asked not be crawled.
    The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes using a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment) will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawling processes.     Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been discovered previously. The new URLs are forwarded to the URL Frontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that today’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be configured to maintain this set as a distributed in-memory hash table (where each crawling process maintains the subset of URLs assigned to it); however, this DUE implementation (which reduces URLs to 8-byte checksums, and uses the first 3 bytes of the checksum to index into the hash table) requires about 5.2 bytes per URL, meaning that it takes over 5 GB of RAM per crawling machine to maintain a set of 1 billion URLs per machine. These memory requirements are too steep in many settings, and in fact, they exceeded the hardware available to us for this experiment. Therefore, we used an alternative DUE implementation that buffers incoming URLs in memory, but keeps the bulk of URLs (or rather, their 8-byte checksums) in sorted order on disk. Whenever the in-memory buffer fills up, it is merged into the disk file (which is a very expensive operation due to disk latency) and newly discovered URLs are passed on to the Frontier.
    Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it possible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead of sending them to the peer node, thereby reducing the amount of network traffic. This reduction is particularly important in a scenario where the individual crawling machines are not connected via a high-speed LAN (as they were in our experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “close to it”.
    Mercator performs an approximation of a breadth-first search traversal of the web graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the scheduling of downloads is moderated by Mercator’s politeness policy, which limits the load placed by the crawler on any particular web server. Mercator’s politeness policy guarantees that no server ever receives multiple requests from Mercator in parallel; in addition, it guarantees that the next request to a server will only be issued after a multiple (typically 10_) of the time it took to answer the previous request has passed. Such a politeness policy is essential to any large-scale web crawler; otherwise the crawler’s operator becomes inundated with complaints.
4.2 Our web crawl
    Our crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.
    The crawl ran from July 12 until September 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links. Earlier studies reported only an average of 8 links [9] or 17 links per page [33]. We offer three explanations as to why we found more links per page. First, we configured Mercator to not limit itself to URLs found in anchor tags, but rather to extract URLs from all tags that may contain them (e.g. image tags). This configuration increases both the mean and the median number of links per page. Second, we configured it to download pages up to 16 MB in size (a setting that is significantly higher than usual), making it possible to encounter pages with tens of thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If we only consider unique links3 per page, then the average number of links is 42.74 and the median is 17.
    The links extracted from these HTML pages, plus about 38 million HTTP redirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the effectiveness of various caching algorithms, we instrumented Mercator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs.
    After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique fingerprints, which we determined using an in-memory hash table on a very-large-memory machine. This data reduction step left us with four condensed host splitter logs (one per crawling machine), ranging from 51 GB to 57 GB in size and containing between 6.4 and 7.1 billion URLs.
    In order to explore the effectiveness of caching with respect to inter-process communication in a distributed crawler, we also extracted a sub-trace of the Host Splitter logs that contained only those URLs that were sent to peer crawlers. These logs contained 4.92 billion URLs, or about 19.5% of all URLs. We condensed the sub-trace logs in the same fashion. We then used the condensed logs for our simulations.
5. SIMULATION RESULTS
    We studied the effects of caching with respect to two streams of URLs:
    1. A trace of all URLs extracted from the pages assigned to a particular machine. We refer to this as the full trace.
    2. A trace of all URLs extracted from the pages assigned to a particular machine that were sent to one of the other machines for processing. We refer to this trace as the cross subtrace, since it is a subset of the full trace.
    The reason for exploring both these choices is that, depending on other architectural decisions, it might make sense to cache only the URLs to be sent to other machines or to use a separate cache just for this purpose.
    We fed each trace into implementations of each of the caching algorithms described above, configured with a wide range of cache sizes. We performed about 1,800 such experiments. We first describe the algorithm implementations, and then present our simulation results.
5.1 Algorithm implementations
    The implementation of each algorithm is straightforward. We use a hash table to find each item in the cache. We also keep a separate data structure of the cache items, so that we can choose one for eviction. For RANDOM, this data structure is simply a list. For CLOCK, it is a list and a clock handle, and the items also contain “mark” bits. For LRU, it is a heap, organized by last access time. STATIC needs no extra data structure, since it never evicts items. MIN is more complicated since for each item in the cache, MIN needs to know when the next request for that item will be. We therefore describe MIN in more detail. Let A be the trace or sequence of requests, that is, At is the item requested at time t. We create a second sequence Nt containing the time when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally,
    To generate the sequence Nt, we read the trace A backwards, that is, from tmax down to 0, and use a hash table with key At and value t. For each item At, we probe the hash table. If it is not found, we set Nt = 1and store (At; t) in the table. If it is found, we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) in the hash table. Given Nt, implementing MIN is easy: we read At and Nt in parallel, and hence for each item requested, we know when it will be requested next. We tag each item in the cache with the time when it will be requested next, and if necessary, evict the item with the highest value for its next request, using a heap to identify itquickly.
5.2 Results
    We present the results for only one crawling host. The results for the other three hosts are quasi-identical. Figure 2 shows the miss rate over the entire trace (that is, the percentage of misses out of all requests to the cache) as a function of the size of the cache. We look at cache sizes from k = 20 to k = 225. In Figure 3 we present the same data relative to the miss-rate of MIN, the optimum off-line algorithm. The same simulations for the cross-trace are depicted in Figures 4 and 5.
    For both traces, LRU and CLOCK perform almost identically and only slightly worse than the ideal MIN, except in the critical region discussed below. RANDOM is only slightly inferior to CLOCK and LRU, while STATIC is generally much worse. Therefore, we conclude that there is considerable locality of reference in the trace, as explained in Section 3.6. For very large caches, STATIC appears to do better than MIN. However, this is just an artifact of our accounting scheme: we only charge for misses and STATIC is not charged for the initial loading of the cache. If STATIC were instead charged k misses for the initial loading of its cache, then its miss rate would be of course worse than MIN’s.
6. CONCLUSIONS AND FUTURE DIRECTIONS
    After running about 1,800 simulations over a trace containing 26.86 billion URLs, our main conclusion is that URL caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this size is a critical point, that is, a substantially smaller cache is ineffectual while a substantially larger cache brings little additional benefit. For practical purposes our investigation is complete: In view of our discussion in Section 5.2, we recommend a cache size of between 100 to 500 entries per crawling thread. All caching strategies perform roughly the same; we recommend using either CLOCK or RANDOM, implemented using a scatter table with circular chains. Thus, for 500 crawling threads, this cache will be about 2MB – completely insignificant compared to other data structures needed in a crawler. If the intent is only to reduce cross machine traffic in a distributed crawler, then a slightly smaller cache could be used. In either case, the goal should be to have a miss rate lower than 20%.
    However, there are some open questions, worthy of further research. The first open problem is to what extent the crawl order strategy (graph traversal method) affects the caching performance. Various strategies have been proposed [14], but there are indications [30] that after a short period from the beginning of the crawl the general strategy does not matter much. Hence, we believe that caching performance will be very similar for any alternative crawling strategy. We can try to implement other strategies ourselves, but ideally we would use independent crawls. Unfortunately, crawling on web scale is not a simple endeavor, and it is unlikely that we can obtain crawl logs from commercial search engines.
    In view of the observed fact that the size of the cache needed to achieve top performance depends on the number of threads, the second question is whether having a per-thread cache makes sense. In general, but not always, a global cache performs better than a collection of separate caches, because common items need to be stored only once. However, this assertion needs to be verified in the URL caching context.
    The third open question concerns the explanation we propose in Section 5 regarding the scope of the links encountered on a given host. If our model is correct then it has certain implications regarding the appropriate model for the web graph, a topic of considerable interest among a wide variety of scientists: mathematicians, physicists, and computer scientists. We hope that our paper will stimulate research to estimate the cache performance under various models. Models where caching performs well due to correlation of links on a given host are probably closer to reality. We are making our URL traces available for this research by donating them to the Internet Archive.

你可能感兴趣的:(网络爬虫(spider))

第03讲原理探究，了解爬虫的基本原理 roc_lpy Python
转载于拉勾教育-52讲轻松搞定网络爬虫-崔庆才1.爬虫概述我们可以把互联网比作一张大网，而爬虫（即网络爬虫）便是在网上爬行的蜘蛛。如果把网的节点比作一个个网页，爬虫爬到这就相当于访问了该页面，获取了其信息。可以把节点间的连线比作网页与网页之间的链接关系，这样蜘蛛通过一个节点后，可以顺着节点连线继续爬行到达下一个节点，即通过一个网页继续获取后续的网页，这样整个网的节点便可以被蜘蛛全部爬行到，网站的数
python爬虫scrapy爬取新闻标题及链接_18Python爬虫---CrawlSpider自动爬取新浪新闻网页标题和链接... 珍妮赵
一、爬取新浪新闻思路1、创建scrapy项目2、分析新浪新闻网站静态页面代码3、编写对应的xpath公式4、写代码二、项目代码步骤1、创建scrapy项目scrapystartprojectmycwpjt步骤2、分析新浪网站静态代码随便打开一个新浪新闻网，新闻可以看到地址为http://news.sina.com.cn/gov/xlxw/2018-03-21/doc-ifyskeue0491622
爬虫scrapy框架进阶-CrawlSpider, Rule 吃猫的鱼python 爬虫 python 数据挖掘 scrapy
文章适合于所有的相关人士进行学习各位看官看完了之后不要立刻转身呀期待三连关注小小博主加收藏⚓️小小博主回关快会给你意想不到的惊喜呀⚓️文章目录scrapy中加入CrawlSpider️创建项目️提取器和规则RULEscrapy爬虫实战️分析网站️代码部分1.settings部分2.starts部分3.items部分4.重要的lyw_spider部分5.pipelines部分scrapy中加入Cra
基于网络爬虫技术的网络新闻分析众拾达人 Java Web 爬虫爬虫
文末附有完整项目代码在信息爆炸的时代，如何从海量的网络新闻中挖掘出有价值的信息呢？今天就来给大家分享一下基于网络爬虫技术的网络新闻分析的实现过程。首先，我们来了解一下系统的需求。我们的目标是能够实时抓取凤凰网新闻、网易新闻、搜狐新闻等网站的新闻数据，正确抽取正文并获取点击量，每日定时抓取。然后对抓取回来的新闻进行中文分词，利用分词结果计算新闻相似度，将相似新闻合并并展示相似新闻的用户点击趋势。接下
Python 网络爬虫进阶：动态网页爬取与反爬机制应对 m0_74824534 python 爬虫开发语言
在上一篇文章中，我们学习了如何使用Python构建一个基本的网络爬虫。然而，在实际应用中，许多网站使用动态内容加载或实现反爬机制来阻止未经授权的抓取。因此，本篇文章将深入探讨以下进阶主题：如何处理动态加载的网页内容应对常见的反爬机制爬虫性能优化通过具体实例，我们将探讨更复杂的网络爬虫开发技巧。一、动态网页爬取现代网页通常通过JavaScript加载动态内容。直接使用requests获取的HTML可
推荐文章：揭开贝壳网房价数据的神秘面纱 —— BeikeSpider 深度探索郎纪洋
推荐文章：揭开贝壳网房价数据的神秘面纱——BeikeSpider深度探索BeikeSpider贝壳网房价爬虫,基于Scrapy,采集北京上海广州深圳等21个中国主要城市的房价数据（小区,二手房），稳定可靠快速！支持csv存储，注释丰富，链家网爬虫见我另一个项目项目地址:https://gitcode.com/gh_mirrors/be/BeikeSpider项目介绍在房地产大数据的时代背景下，获取
Java网络爬虫技术之路：从基础构建到实战应用 Yori_22 java 爬虫开发语言
在当今信息爆炸的时代，数据已成为一种宝贵的资源。如何从海量的网络数据中提取有价值的信息，成为了众多开发者和数据科学家的共同挑战。Java网络爬虫技术，作为一种自动化采集数据的工具，正逐渐成为解决这一问题的关键。本文将带您踏上Java网络爬虫技术之路，从基础构建到实战应用，一步步揭开其神秘面纱。一、Java网络爬虫技术基础网络爬虫，又称网络蜘蛛或网络机器人，是一种按照一定规则自动抓取互联网信息的程序
NL2SQL技术方案系列(6)：金融领域知识检索，NL2SQL技术方案以及行业案例实战讲解4 汀、人工智能 LLM工业级落地实践人工智能 NL2SQL LLM 自然语言处理 copilot 知识检索语义搜索
NL2SQL技术方案系列(6)：金融领域NL2SQL技术方案以及行业案例实战讲解4NL2SQL基础系列(1)：业界顶尖排行榜、权威测评数据集及LLM大模型（SpidervsBIRD）全面对比优劣分析[Text2SQL、Text2DSL]NL2SQL基础系列(2)：主流大模型与微调方法精选集，Text2SQL经典算法技术回顾七年发展脉络梳理NL2SQL进阶系列(1)：DB-GPT-Hub、SQLco
Python网络爬虫入门教程：从抓取数据到应用实现冷夜雨. python
引言在大数据时代，信息就是力量。各种网站每天产生着海量的数据，这些数据中蕴藏着巨大的商业价值和研究价值。如何快速、自动化地从互联网上获取这些信息，成为了数据科学、人工智能、市场分析等领域中的一个重要课题。Python，作为一门易于学习且功能强大的编程语言，其丰富的库和工具使得构建网络爬虫变得非常简单。网络爬虫（WebScraper）是一种自动化程序，用来从网页中提取信息。无论是用于数据分析、竞争对
Java爬虫框架（一）--架构设计狼图腾-狼之传说 java 框架 java 任务 html解析器存储电子商务
一、架构图那里搜网络爬虫框架主要针对电子商务网站进行数据爬取，分析，存储，索引。爬虫：爬虫负责爬取，解析，处理电子商务网站的网页的内容数据库：存储商品信息索引：商品的全文搜索索引Task队列：需要爬取的网页列表Visited表：已经爬取过的网页列表爬虫监控平台：web平台可以启动，停止爬虫，管理爬虫，task队列，visited表。二、爬虫1.流程1)Scheduler启动爬虫器，TaskMast
WebMagic：强大的Java爬虫框架解析与实战 Aaron_945 Java java 爬虫开发语言
文章目录引言官网链接WebMagic原理概述基础使用1.添加依赖2.编写PageProcessor高级使用1.自定义Pipeline2.分布式抓取优点结论引言在大数据时代，网络爬虫作为数据收集的重要工具，扮演着不可或缺的角色。Java作为一门广泛使用的编程语言，在爬虫开发领域也有其独特的优势。WebMagic是一个开源的Java爬虫框架，它提供了简单灵活的API，支持多线程、分布式抓取，以及丰富的
00. 这里整理了最全的爬虫框架（Java + Python）有一只柴犬爬虫系列爬虫 java python
目录1、前言2、什么是网络爬虫3、常见的爬虫框架3.1、java框架3.1.1、WebMagic3.1.2、Jsoup3.1.3、HttpClient3.1.4、Crawler4j3.1.5、HtmlUnit3.1.6、Selenium3.2、Python框架3.2.1、Scrapy3.2.2、BeautifulSoup+Requests3.2.3、Selenium3.2.4、PyQuery3.2
python爬取微信小程序数据,python爬取小程序数据 2301_81900439 前端
大家好，小编来为大家解答以下问题，python爬取微信小程序数据，python爬取小程序数据，现在让我们一起来看看吧！Python爬虫系列之微信小程序实战基于Scrapy爬虫框架实现对微信小程序数据的爬取首先，你得需要安装抓包工具，这里推荐使用Charles，至于怎么使用后期有时间我会出一个事例最重要的步骤之一就是分析接口，理清楚每一个接口功能，然后连接起来形成接口串思路,再通过Spider的回调
Python精选200Tips：121-125 AnFany Python200+Tips python 开发语言
Spendyourtimeonself-improvement121Requests-简化的HTTP请求处理发送GET请求发送POST请求发送PUT请求发送DELETE请求会话管理处理超时文件上传122BeautifulSoup-网页解析和抓取解析HTML和XML文档查找单个标签查找多个标签使用CSS选择器查找标签提取文本修改文档内容删除标签处理XML文档123Scrapy-强大的网络爬虫框架示例
爬虫之隧道代理：如何在爬虫中使用代理IP？ 2401_87251497 python 开发语言爬虫网络 tcp/ip 网络协议
在进行网络爬虫时，使用代理IP是一种常见的方式来绕过网站的反爬虫机制，提高爬取效率和数据质量。本文将详细介绍如何在爬虫中使用隧道代理，包括其原理、优势以及具体的实现方法。无论您是爬虫新手还是有经验的开发者，这篇文章都将为您提供实用的指导。什么是隧道代理？隧道代理是一种高级的代理技术，它通过创建一个加密的隧道，将数据从客户端传输到代理服务器，再由代理服务器转发到目标服务器。这样不仅可以隐藏客户端的真
Python爬虫代理池极客李华 python授课 python 爬虫开发语言
Python爬虫代理池网络爬虫在数据采集和信息抓取方面起到了关键作用。然而，为了应对网站的反爬虫机制和保护爬虫的真实身份，使用代理池变得至关重要。1.代理池的基本概念：代理池是一组包含多个代理IP地址的集合。通过在爬虫中使用代理池，我们能够隐藏爬虫的真实IP地址，实现一定程度的匿名性。这有助于防止被目标网站封锁或限制访问频率。2.为何使用代理池：匿名性：代理池允许爬虫在请求目标网站时使用不同的IP
python 多线程抓取xunlei磁力下载链接 weixin_53748624 python pycharm
importurllib.requestimportreimporttimeimportthreadingclassSpider(object):def__init__(self):#定义字典，用于保存影片信息self.films_dict={}self.i=1self.lock1=threading.Lock()defstart(self):#调用下载函数，获取下载连接forpageinrang
盘点一个Python网络爬虫抓取股票代码问题（上篇）皮皮_f075
大家好，我是皮皮。一、前言前几天在Python白银群【厚德载物】问了一个Python网络爬虫的问题，这里拿出来给大家分享下。image.png二、实现过程这个问题其实for循环就可以搞定了，看上去粉丝的代码没有带请求头那些，导致获取不到数据。后来【瑜亮老师】、【小王子】给了具体思路，代码如下图所示：image.png后来【小王子】也给了一个具体代码，如下：importrequestsimportt
python类变量初始化_python中用函数初始化类变量 | 学步园 weixin_39573512 python类变量初始化
今天在写python的时候遇到一个问题:定义了一个list类型的类变量,但是这个list需要在初始化的时候给它加很多的url进去.这样的话我们就需要用倒函数了.结果自己刚开始这样写的:classTianyaSpider(CrawlSpider):definit_start():url_l=u'http://search.tianya.cn/s?tn=sty&rn=10&pn='url_r=u'&s
python ray分布式_取代 Python 多进程！伯克利开源分布式框架 Ray weixin_39946313 python ray分布式
Ray由伯克利开源，是一个用于并行计算和分布式Python开发的开源项目。本文将介绍如何使用Ray轻松构建可从笔记本电脑扩展到大型集群的应用程序。并行和分布式计算是现代应用程序的主要内容。我们需要利用多个核心或多台机器来加速应用程序或大规模运行它们。网络爬虫和搜索所使用的基础设施并不是在某人笔记本电脑上运行的单线程程序，而是相互通信和交互的服务的集合。云计算承诺在所有维度上(内存、计算、存储等)实
如何用python爬取股票数据选股_用python爬取股票数据 weixin_39752087
获取数据是数据分析中必不可少的一部分，而网络爬虫是是获取数据的一个重要渠道之一。鉴于此，我拾起了Python这把利器，开启了网络爬虫之路。本篇使用的版本为python3.5，意在抓取证券之星上当天所有A股数据。程序主要分为三个部分：网页源码的获取、所需内容的提取、所得结果的整理。一、网页源码的获取很多人喜欢用python爬虫的原因之一就是它容易上手。只需以下几行代码既可抓取大部分网页的源码。imp
使用 RecursiveUrlLoader 实现递归网页爬取：深入解析与实践指南 qq_37836323 python 前端数据库
使用RecursiveUrlLoader实现递归网页爬取：深入解析与实践指南1.引言在当今的数字时代，网络爬虫已成为获取和分析大量在线信息的重要工具。LangChain提供的RecursiveUrlLoader是一个强大的工具，能够递归地爬取网页内容，并将其转换为易于处理的文档格式。本文将深入探讨RecursiveUrlLoader的使用方法、特性以及实际应用场景。2.RecursiveUrlLo
Python 协程 & 异步编程 (asyncio) 入门介绍 linmeiyun 后端 python python 爬虫学习开发语言机器学习
在近期的编码工作过程中遇到了async和await装饰的函数，查询资料后了解到这种函数是基于协程的异步函数。这类编程方式称为异步编程，常用在IO较频繁的系统中，如：Tornadoweb框架、文件下载、网络爬虫等应用。协程能够在IO等待时间就去切换执行其他任务，当IO操作结束后再自动回调，那么就会大大节省资源并提供性能。接下来便简单的讲解一下异步编程相关概念以及案例演示。1.协程简介1.1协程的含义
python网络爬虫（五）——爬取天气预报光电的一只菜鸡 python python 爬虫开发语言
1.注册高德天气key 点击高德天气，然后按照开发者文档完成key注册；作为爬虫练习项目之一。从高德地图json数据接口获取天气，可以获取某省的所有城市天气，高德地图的这个接口还能获取县城的天气。其天气查询API服务地址为https://restapi.amap.com/v3/weather/weatherInfo?parameters，若要获取某城市的天气推荐2.安装MongoDB Mong
顶级的python入门教程！小白到大师，从这篇教程开始！马大哈（Python） python pycharm 开发语言学习青少年编程
1.为什么要学习Python？学习Python的原因有很多，以下是几个主要的原因：广泛应用：Python被广泛应用于Web开发、数据科学、人工智能、机器学习、自动化运维、网络爬虫、科学计算、游戏开发等多个领域。掌握Python意味着你可以在这些领域中找到丰富的职业机会。入门简单：Python的语法简洁明了，易于学习和理解，对于编程初学者来说非常友好。它的代码风格一致，可读性强，有助于培养良好的编程
open-spider开源爬虫工具：抖音数据采集_抖音直播爬虫采集 2401_83817769 程序员爬虫
静态内容抓取是指从网页中直接提取信息的过程。这通常涉及到以下几个步骤：使用requests库发送HTTP请求，获取网页的原始数据。例如，你可以使用requests.get(url)来获取抖音首页的HTML内容。利用BeautifulSoup库对获取到的HTML进行解析。BeautifulSoup提供了丰富的方法来处理和提取HTML文档中的数据。例如，你可以使用find()或find_all()方法
爬虫更换ip地址 xiaoxiongip666 爬虫 tcp/ip 网络协议
网络爬虫更换IP地址是为了应对网站的反爬策略，如IP限制、频率控制等。IP地址轮换的主要目的是保持匿名性和隐蔽性，防止被目标服务器识别为同一个爬虫客户端。以下是一些常见的IP更换方法：使用代理IP池：通过购买或使用免费的代理IP服务，爬虫程序会周期性地从池中获取一个新的IP地址，然后进行请求。常见的代理服务提供商有小熊IP等。间隔时间更换：爬虫可以在每次请求之间设置一个随机或固定的等待时间，然后更
【ttf压缩】网页开发中引入字体文件过大，加载缓慢的解决办法【字蛛】【web Font】 Luckstar_wei 技术 css html 字体 ttf压缩中文字体压缩
yueyemoyanweb中文字体演示与工具使用请前往主页：http://font-spider.org/需要安装node.js输入以下命令：npminstallfont-spider-g运行安装成功之后就开始压缩了我的目录是这样的我的css文件开头是这样的这里要确保ttf文件一定要有，其他的不管在你的html中也引用了相应的css文件接下来就是最后一步了生成新的字体库nodejs命令行输入fon
Python爬虫实战 weixin_34007879 爬虫 json java
引言网络爬虫是抓取互联网信息的利器，成熟的开源爬虫框架主要集中于两种语言Java和Python。主流的开源爬虫框架包括：1.分布式爬虫框架：Nutch2.Java单机爬虫框架：Crawler4j,WebMagic,WebCollector、Heritrix3.python单机爬虫框架：scrapy、pyspiderNutch是专为搜索引擎设计的的分布式开源框架，上手难度高，开发复杂，基本无法满足快
Day21—爬虫性能优化技巧 Ztop 爬虫（新手推荐）爬虫 python 性能优化
在网络爬虫的开发过程中，性能优化是一个关键环节。一个高效的爬虫不仅能够快速完成任务，还能减轻对目标网站的压力，降低被封禁的风险。本文将讨论如何优化爬虫性能，包括请求头优化、连接池、缓存策略等技巧。1.请求头优化请求头是HTTP请求的重要组成部分，它包含了客户端向服务器发送的元数据。通过优化请求头，可以模拟正常用户的行为，减少被网站识别为爬虫的可能性。User-Agent：设置合适的User-Age
[星球大战]阿纳金的背叛 comsci
本来杰迪圣殿的长老是不同意让阿纳金接受训练的......... 但是由于政治原因,长老会妥协了...这给邪恶的力量带来了机会所以......现代的地球联邦接受了这个教训...绝对不让某些年轻人进入学院
看懂它，你就可以任性的玩耍了！ aijuans JavaScript
javascript作为前端开发的标配技能，如果不掌握好它的三大特点：1.原型 2.作用域 3. 闭包 ,又怎么可以说你学好了这门语言呢？如果标配的技能都没有撑握好，怎么可以任性的玩耍呢？怎么验证自己学好了以上三个基本点呢，我找到一段不错的代码，稍加改动，如果能够读懂它，那么你就可以任性了。 function jClass(b
Java常用工具包 Jodd Kai_Ge java jodd
Jodd 是一个开源的 Java 工具集，包含一些实用的工具类和小型框架。简单，却很强大！写道 Jodd = Tools + IoC + MVC + DB + AOP + TX + JSON + HTML < 1.5 Mb Jodd 被分成众多模块，按需选择，其中工具类模块有： jodd-core &nb
SpringMvc下载 120153216 springMVC
@RequestMapping(value = WebUrlConstant.DOWNLOAD) public void download(HttpServletRequest request,HttpServletResponse response,String fileName) { OutputStream os = null; InputStream is = null;
Python 标准异常总结 2002wmj python
Python标准异常总结 AssertionError 断言语句（assert）失败 AttributeError 尝试访问未知的对象属性 EOFError 用户输入文件末尾标志EOF（Ctrl+d） FloatingPointError 浮点计算错误 GeneratorExit generator.close()方法被调用的时候 ImportError 导入模块失
SQL函数返回临时表结构的数据用于查询 357029540 SQL Server
这两天在做一个查询的SQL，这个SQL的一个条件是通过游标实现另外两张表查询出一个多条数据，这些数据都是INT类型，然后用IN条件进行查询，并且查询这两张表需要通过外部传入参数才能查询出所需数据，于是想到了用SQL函数返回值，并且也这样做了，由于是返回多条数据，所以把查询出来的INT类型值都拼接为了字符串，这时就遇到问题了，在查询SQL中因为条件是INT值，SQL函数的CAST和CONVERST都
java 时间格式化 | 比较大小| 时区个人笔记 7454103 java eclipse tomcat c MyEclipse
个人总结！不当之处多多包含！引用 1.0 如何设置 tomcat 的时区：位置：(catalina.bat---JAVA_OPTS 下面加上) set JAVA_OPT
时间获取Clander的用法 adminjun Clander 时间
/** * 得到几天前的时间 * @param d * @param day * @return */ public static Date getDateBefore(Date d,int day){ Calend
JVM初探与设置 aijuans java
JVM是Java Virtual Machine（Java虚拟机）的缩写，JVM是一种用于计算设备的规范，它是一个虚构出来的计算机，是通过在实际的计算机上仿真模拟各种计算机功能来实现的。Java虚拟机包括一套字节码指令集、一组寄存器、一个栈、一个垃圾回收堆和一个存储方法域。 JVM屏蔽了与具体操作系统平台相关的信息，使Java程序只需生成在Java虚拟机上运行的目标代码（字节码）,就可以在多种平台
SQL中ON和WHERE的区别 avords
SQL中ON和WHERE的区别数据库在通过连接两张或多张表来返回记录时，都会生成一张中间的临时表，然后再将这张临时表返回给用户。 www.2cto.com 在使用left jion时，on和where条件的区别如下： 1、 on条件是在生成临时表时使用的条件，它不管on中的条件是否为真，都会返回左边表中的记录。
说说自信 houxinyou 工作生活
自信的来源分为两种,一种是源于实力,一种源于头脑.实力是一个综合的评定,有自身的能力,能利用的资源等.比如我想去月亮上,要身体素质过硬,还要有飞船等等一系列的东西.这些都属于实力的一部分.而头脑不同,只要你头脑够简单就可以了!同样要上月亮上,你想,我一跳,1米,我多跳几下,跳个几年,应该就到了!什么?你说我会往下掉?你笨呀你!找个东西踩一下不就行了吗? 无论工作还
WEBLOGIC事务超时设置 bijian1013 weblogic jta 事务超时
系统中统计数据，由于调用统计过程，执行时间超过了weblogic设置的时间，提示如下错误：统计数据出错! 原因：The transaction is no longer active - status: 'Rolling Back. [Reason=weblogic.transaction.internal
两年已过去，再看该如何快速融入新团队 bingyingao java 互联网融入架构新团队
偶得的空闲，翻到了两年前的帖子该如何快速融入一个新团队，有所感触，就记下来，为下一个两年后的今天做参考。时隔两年半之后的今天，再来看当初的这个博客，别有一番滋味。而我已经于今年三月份离开了当初所在的团队，加入另外的一个项目组，2011年的这篇博客之后的时光，我很好的融入了那个团队，而直到现在和同事们关系都特别好。大家在短短一年半的时间离一起经历了一
【Spark七十七】Spark分析Nginx和Apache的access.log bit1129 apache
Spark分析Nginx和Apache的access.log，第一个问题是要对Nginx和Apache的access.log文件进行按行解析，按行解析就的方法是正则表达式： Nginx的access.log解析正则表达式 val PATTERN = """([^ ]*) ([^ ]*) ([^ ]*) (\\[.*\\]) (\&q
Erlang patch bookjovi erlang
Totally five patchs committed to erlang otp, just small patchs. IMO, erlang really is a interesting programming language, I really like its concurrency feature. but the functional programming style
log4j日志路径中加入日期 bro_feng java log4j
要用log4j使用记录日志，日志路径有每日的日期，文件大小5M新增文件。实现方式 log4j: <appender name="serviceLog" class="org.apache.log4j.RollingFileAppender"> <param name="Encoding" v
读《研磨设计模式》-代码笔记-桥接模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * 个人觉得关于桥接模式的例子，蜡笔和毛笔这个例子是最贴切的：http://www.cnblogs.com/zhenyulu/articles/67016.html * 笔和颜色是可分离的，蜡笔把两者耦合在一起了：一支蜡笔只有一种
windows7下SVN和Eclipse插件安装 chenyu19891124 eclipse插件
今天花了一天时间弄SVN和Eclipse插件的安装，今天弄好了。svn插件和Eclipse整合有两种方式，一种是直接下载插件包，二种是通过Eclipse在线更新。由于之前Eclipse版本和svn插件版本有差别，始终是没装上。最后在网上找到了适合的版本。所用的环境系统：windows7JDK：1.7svn插件包版本：1.8.16Eclipse：3.7.2工具下载地址：Eclipse下在地址：htt
[转帖]工作流引擎设计思路 comsci 设计模式工作应用服务器 workflow 企业应用
作为国内的同行，我非常希望在流程设计方面和大家交流，刚发现篇好文(那么好的文章，现在才发现，可惜)，关于流程设计的一些原理，个人觉得本文站得高，看得远，比俺的文章有深度，转载如下 ================================================================================= 自开博以来不断有朋友来探讨工作流引擎该如何
Linux 查看内存，CPU及硬盘大小的方法 daizj linux cpu 内存硬盘大小
一、查看CPU信息的命令 [root@R4 ~]# cat /proc/cpuinfo |grep "model name" && cat /proc/cpuinfo |grep "physical id" model name : Intel(R) Xeon(R) CPU X5450 @ 3.00GHz model name :
linux 踢出在线用户 dongwei_6688 linux
两个步骤： 1.用w命令找到要踢出的用户，比如下面： [root@localhost ~]# w 18:16:55 up 39 days, 8:27, 3 users, load average: 0.03, 0.03, 0.00 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
放手吧,就像不曾拥有过一样 dcj3sjt126com
内容提要：静悠悠编著的《放手吧就像不曾拥有过一样》集结“全球华语世界最舒缓心灵”的精华故事，触碰生命最深层次的感动，献给全世界亿万读者。《放手吧就像不曾拥有过一样》的作者衷心地祝愿每一位读者都给自己一个重新出发的理由，将那些令你痛苦的、扛起的、背负的，一并都放下吧！把憔悴的面容换做一种清淡的微笑，把沉重的步伐调节成春天五线谱上的音符，让自己踏着轻快的节奏，在人生的海面上悠然漂荡，享受宁静与
php二进制安全的含义 dcj3sjt126com PHP
PHP里，有string的概念。 string里，每个字符的大小为byte（与PHP相比，Java的每个字符为Character，是UTF8字符，C语言的每个字符可以在编译时选择）。 byte里，有ASCII代码的字符，例如ABC，123，abc，也有一些特殊字符，例如回车，退格之类的。特殊字符很多是不能显示的。或者说，他们的显示方式没有标准，例如编码65到哪儿都是字母A，编码97到哪儿都是字符
Linux下禁用T440s，X240的一体化触摸板(touchpad) gashero linux ThinkPad 触摸板
自打1月买了Thinkpad T440s就一直很火大，其中最让人恼火的莫过于触摸板。 Thinkpad的经典就包括用了小红点(TrackPoint)。但是小红点只能定位，还是需要鼠标的左右键的。但是自打T440s等开始启用了一体化触摸板，不再有实体的按键了。问题是要是好用也行。实际使用中，触摸板一堆问题，比如定位有抖动，以及按键时会有飘逸。这就导致了单击经常就
graph_dfs hcx2013 Graph
package edu.xidian.graph; class MyStack { private final int SIZE = 20; private int[] st; private int top; public MyStack() { st = new int[SIZE]; top = -1; } public void push(i
Spring4.1新特性——Spring核心部分及其他 jinnianshilongnian spring 4.1
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
配置HiveServer2的安全策略之自定义用户名密码验证 liyonghui160com
具体从网上看 http://doc.mapr.com/display/MapR/Using+HiveServer2#UsingHiveServer2-ConfiguringCustomAuthentication LDAP Authentication using OpenLDAP Setting
一位30多的程序员生涯经验总结 pda158 编程工作生活咨询
1.客户在接触到产品之后，才会真正明白自己的需求。　　这是我在我的第一份工作上面学来的。只有当我们给客户展示产品的时候，他们才会意识到哪些是必须的。给出一个功能性原型设计远远比一张长长的文字表格要好。 2.只要有充足的时间，所有安全防御系统都将失败。　　安全防御现如今是全世界都在关注的大课题、大挑战。我们必须时时刻刻积极完善它，因为黑客只要有一次成功，就可以彻底打败你。 3.
分布式web服务架构的演变自由的奴隶 linux Web 应用服务器互联网
最开始，由于某些想法，于是在互联网上搭建了一个网站，这个时候甚至有可能主机都是租借的，但由于这篇文章我们只关注架构的演变历程，因此就假设这个时候已经是托管了一台主机，并且有一定的带宽了，这个时候由于网站具备了一定的特色，吸引了部分人访问，逐渐你发现系统的压力越来越高，响应速度越来越慢，而这个时候比较明显的是数据库和应用互相影响，应用出问题了，数据库也很容易出现问题，而数据库出问题的时候，应用也容易
初探Druid连接池之二——慢SQL日志记录 xingsan_zhang 日志连接池 druid 慢SQL
由于工作原因，这里先不说连接数据库部分的配置，后面会补上，直接进入慢SQL日志记录。 1.applicationContext.xml中增加如下配置： <bean abstract="true" id="mysql_database" class="com.alibaba.druid.pool.DruidDataSourc