woshizn

基于网络爬虫的有效URL缓存(英文原文）

Efficient URL Caching for World Wide Web Crawling
Andrei Z. Broder
IBM TJ Watson Research Center
19 Skyline Dr
Hawthorne, NY 10532
[email protected]
Marc Najork
Microsoft Research
1065 La Avenida
Mountain View, CA 94043
[email protected]
Janet L. Wiener
Hewlett Packard Labs
1501 Page Mill Road
Palo Alto, CA 94304
[email protected]
ABSTRACT
Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.
A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
1. INTRODUCTION
A recent Pew Foundation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.
Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is
(a) Fetch a page
(b) Parse it to extract all linked URLs
(c) For all the URLs not seen before, repeat (a)–(c)
Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as yahoo.com, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.
If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]).
However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors.
1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.
2. Web pages are changing rapidly. If “change” means “any change”, then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17].
These two factors imply that to obtain a reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally.
A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently.
The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.
2. CRAWLING
Web crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.
The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.
The original Google crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.
For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4.
Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such as its top-level domain).
The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.
UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multiple independent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.
Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded.
Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult the disk-based set.
Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper.
3. CACHING
In most computer systems, memory is hierarchical, that is, there exist two or more levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right circumstances, caching greatly improves the performance of the overall system and hence it is a fundamental technique in the design of operating systems, discussed at length in any standard textbook [21, 37]. In the web context, caching is often mentioned
in the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.
Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of the caching algorithm is to minimize the number of misses.
Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:
_ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instance, a link to yahoo.com is a much more common occurrence than a link to the authors’ home pages.
_ Temporal correlation or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latter terminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf. Section 4.2, and second, because pages on a given host tend to be explored sequentially and they tend to share many links. For example, many pages on a Computer Science department server are likely to share links to other Computer Science departments in the world, notorious papers, etc.
Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.
We now describe some standard caching algorithms, whose performance we evaluate in Section 5.
3.1 Infinite cache (INFINITE)
This is a theoretical algorithm that assumes that the size of the cache is larger than the number of distinct requests.
3.2 Clairvoyant caching (MIN)
More than 35 years ago, L´aszl´o Belady [2] showed that if the entire sequence of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance.
3.3 Least recently used (LRU)
The LRU algorithm evicts the item in the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefore the number of misses will be minimized in the spirit of Belady’s algorithm.
Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. This queue has a processing time cost and a memory cost. The latter is usually ignored in caching situations where the items are large.
3.4 CLOCK
CLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves sequentially through the array, turning the mark bits off, until an unmarked location is found. The cache item corresponding to the unmarked location is evicted and replaced by X.
3.5 Random replacement (RANDOM)
Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.
In most practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Section 6.
3.6 Static caching (STATIC)
If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.
There are two issues with this approach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in most practical situations.
In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the current crawl. (There are also efficient techniques for discovering the most popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a first pass over our input to determine the k most popular URLs, and then preload the cache with these URLs, which is what we did in our experiments.
The second issue above is the very reason we decided to test STATIC: if STATIC performs well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests are highly correlated.
4. EXPERIMENTAL SETUP
We now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl.
4.1 Mercator crawler architecture
A Mercator crawling system consists of a number of crawling processes, usually
running on separate machines. Each crawling process is responsible for a subset of all web servers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.
Each worker thread repeatedly performs the following operations: it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for short); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic properties. For example, it discards all URLs belonging to web servers that contacted us and asked not be crawled.
The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes using a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment) will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawling processes. Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been discovered previously. The new URLs are forwarded to the URL Frontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that today’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be configured to maintain this set as a distributed in-memory hash table (where each crawling process maintains the subset of URLs assigned to it); however, this DUE implementation (which reduces URLs to 8-byte checksums, and uses the first 3 bytes of the checksum to index into the hash table) requires about 5.2 bytes per URL, meaning that it takes over 5 GB of RAM per crawling machine to maintain a set of 1 billion URLs per machine. These memory requirements are too steep in many settings, and in fact, they exceeded the hardware available to us for this experiment. Therefore, we used an alternative DUE implementation that buffers incoming URLs in memory, but keeps the bulk of URLs (or rather, their 8-byte checksums) in sorted order on disk. Whenever the in-memory buffer fills up, it is merged into the disk file (which is a very expensive operation due to disk latency) and newly discovered URLs are passed on to the Frontier.
Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it possible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead of sending them to the peer node, thereby reducing the amount of network traffic. This reduction is particularly important in a scenario where the individual crawling machines are not connected via a high-speed LAN (as they were in our experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “close to it”.
Mercator performs an approximation of a breadth-first search traversal of the web graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the scheduling of downloads is moderated by Mercator’s politeness policy, which limits the load placed by the crawler on any particular web server. Mercator’s politeness policy guarantees that no server ever receives multiple requests from Mercator in parallel; in addition, it guarantees that the next request to a server will only be issued after a multiple (typically 10_) of the time it took to answer the previous request has passed. Such a politeness policy is essential to any large-scale web crawler; otherwise the crawler’s operator becomes inundated with complaints.
4.2 Our web crawl
Our crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.
The crawl ran from July 12 until September 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links. Earlier studies reported only an average of 8 links [9] or 17 links per page [33]. We offer three explanations as to why we found more links per page. First, we configured Mercator to not limit itself to URLs found in anchor tags, but rather to extract URLs from all tags that may contain them (e.g. image tags). This configuration increases both the mean and the median number of links per page. Second, we configured it to download pages up to 16 MB in size (a setting that is significantly higher than usual), making it possible to encounter pages with tens of thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If we only consider unique links3 per page, then the average number of links is 42.74 and the median is 17.
The links extracted from these HTML pages, plus about 38 million HTTP redirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the effectiveness of various caching algorithms, we instrumented Mercator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs.
After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique fingerprints, which we determined using an in-memory hash table on a very-large-memory machine. This data reduction step left us with four condensed host splitter logs (one per crawling machine), ranging from 51 GB to 57 GB in size and containing between 6.4 and 7.1 billion URLs.
In order to explore the effectiveness of caching with respect to inter-process communication in a distributed crawler, we also extracted a sub-trace of the Host Splitter logs that contained only those URLs that were sent to peer crawlers. These logs contained 4.92 billion URLs, or about 19.5% of all URLs. We condensed the sub-trace logs in the same fashion. We then used the condensed logs for our simulations.
5. SIMULATION RESULTS
We studied the effects of caching with respect to two streams of URLs:
1. A trace of all URLs extracted from the pages assigned to a particular machine. We refer to this as the full trace.
2. A trace of all URLs extracted from the pages assigned to a particular machine that were sent to one of the other machines for processing. We refer to this trace as the cross subtrace, since it is a subset of the full trace.
The reason for exploring both these choices is that, depending on other architectural decisions, it might make sense to cache only the URLs to be sent to other machines or to use a separate cache just for this purpose.
We fed each trace into implementations of each of the caching algorithms described above, configured with a wide range of cache sizes. We performed about 1,800 such experiments. We first describe the algorithm implementations, and then present our simulation results.
5.1 Algorithm implementations
The implementation of each algorithm is straightforward. We use a hash table to find each item in the cache. We also keep a separate data structure of the cache items, so that we can choose one for eviction. For RANDOM, this data structure is simply a list. For CLOCK, it is a list and a clock handle, and the items also contain “mark” bits. For LRU, it is a heap, organized by last access time. STATIC needs no extra data structure, since it never evicts items. MIN is more complicated since for each item in the cache, MIN needs to know when the next request for that item will be. We therefore describe MIN in more detail. Let A be the trace or sequence of requests, that is, At is the item requested at time t. We create a second sequence Nt containing the time when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally,
To generate the sequence Nt, we read the trace A backwards, that is, from tmax down to 0, and use a hash table with key At and value t. For each item At, we probe the hash table. If it is not found, we set Nt = 1and store (At; t) in the table. If it is found, we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) in the hash table. Given Nt, implementing MIN is easy: we read At and Nt in parallel, and hence for each item requested, we know when it will be requested next. We tag each item in the cache with the time when it will be requested next, and if necessary, evict the item with the highest value for its next request, using a heap to identify itquickly.
5.2 Results
We present the results for only one crawling host. The results for the other three hosts are quasi-identical. Figure 2 shows the miss rate over the entire trace (that is, the percentage of misses out of all requests to the cache) as a function of the size of the cache. We look at cache sizes from k = 20 to k = 225. In Figure 3 we present the same data relative to the miss-rate of MIN, the optimum off-line algorithm. The same simulations for the cross-trace are depicted in Figures 4 and 5.
For both traces, LRU and CLOCK perform almost identically and only slightly worse than the ideal MIN, except in the critical region discussed below. RANDOM is only slightly inferior to CLOCK and LRU, while STATIC is generally much worse. Therefore, we conclude that there is considerable locality of reference in the trace, as explained in Section 3.6. For very large caches, STATIC appears to do better than MIN. However, this is just an artifact of our accounting scheme: we only charge for misses and STATIC is not charged for the initial loading of the cache. If STATIC were instead charged k misses for the initial loading of its cache, then its miss rate would be of course worse than MIN’s.
6. CONCLUSIONS AND FUTURE DIRECTIONS
After running about 1,800 simulations over a trace containing 26.86 billion URLs, our main conclusion is that URL caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this size is a critical point, that is, a substantially smaller cache is ineffectual while a substantially larger cache brings little additional benefit. For practical purposes our investigation is complete: In view of our discussion in Section 5.2, we recommend a cache size of between 100 to 500 entries per crawling thread. All caching strategies perform roughly the same; we recommend using either CLOCK or RANDOM, implemented using a scatter table with circular chains. Thus, for 500 crawling threads, this cache will be about 2MB – completely insignificant compared to other data structures needed in a crawler. If the intent is only to reduce cross machine traffic in a distributed crawler, then a slightly smaller cache could be used. In either case, the goal should be to have a miss rate lower than 20%.
However, there are some open questions, worthy of further research. The first open problem is to what extent the crawl order strategy (graph traversal method) affects the caching performance. Various strategies have been proposed [14], but there are indications [30] that after a short period from the beginning of the crawl the general strategy does not matter much. Hence, we believe that caching performance will be very similar for any alternative crawling strategy. We can try to implement other strategies ourselves, but ideally we would use independent crawls. Unfortunately, crawling on web scale is not a simple endeavor, and it is unlikely that we can obtain crawl logs from commercial search engines.
In view of the observed fact that the size of the cache needed to achieve top performance depends on the number of threads, the second question is whether having a per-thread cache makes sense. In general, but not always, a global cache performs better than a collection of separate caches, because common items need to be stored only once. However, this assertion needs to be verified in the URL caching context.
The third open question concerns the explanation we propose in Section 5 regarding the scope of the links encountered on a given host. If our model is correct then it has certain implications regarding the appropriate model for the web graph, a topic of considerable interest among a wide variety of scientists: mathematicians, physicists, and computer scientists. We hope that our paper will stimulate research to estimate the cache performance under various models. Models where caching performs well due to correlation of links on a given host are probably closer to reality. We are making our URL traces available for this research by donating them to the Internet Archive.

你可能感兴趣的:(Web,cache,ant,UP,performance)

什么是三高架构? java1234_小锋 java 架构 java 微服务
大家好，我是锋哥。今天分享关于【什么是三高架构?】面试题。希望对大家有帮助；什么是三高架构?1000道互联网大厂Java工程师精选面试题-Java资源分享网“三高架构”通常是指高可用性（HighAvailability）、高性能（HighPerformance）和高扩展性（HighScalability）架构。这三个特性是现代计算系统、尤其是在分布式系统和云计算架构中，设计和部署的关键目标。以下是
如何用Python爬取网站数据：基础教程与实战大梦百万秋知识学爆 python 开发语言
数据爬取（WebScraping）是从网站中自动获取信息的过程。借助Python强大的库和工具，数据爬取变得非常简单且高效。本文将介绍Python爬取网站数据的基础知识、常用工具，以及一个简单的实战示例，帮助你快速上手网站数据爬取。1.什么是网站数据爬取？网站数据爬取是通过编写程序自动抓取网页内容的技术，通常用于从公开网站中提取特定数据。数据爬取的应用场景非常广泛，包括：收集商品价格和评论数据新闻
JavaWeb 开发入门：从基础到应用大梦百万秋知识学爆 java
JavaWeb是基于Java技术构建的Web应用开发体系。得益于Java的跨平台性和强大的生态系统，JavaWeb长期以来一直是企业级开发的首选方案之一。本篇博客将从JavaWeb的基本概念、核心技术到实际项目开发，带你全面了解如何利用JavaWeb构建一个动态网站。什么是JavaWeb？JavaWeb是使用Java技术开发Web应用程序的总称，通常包括动态网页、交互式功能和后端逻辑。它支持开发以
云原生前端开发：打造现代化高性能的用户体验大梦百万秋知识学爆状态模式
引言：前端开发的新风向在过去的几年中，前端开发领域经历了快速的演变，从早期的静态网页到如今复杂的单页应用（SPA），再到微前端架构和渐进式Web应用（PWA），前端技术一直处于技术变革的中心。而随着云原生的理念在后端开发中逐渐成熟，前端开发也迎来了新的机遇和挑战。云原生前端开发意味着应用的架构设计和开发方式需要更加注重现代化的开发工具链、灵活性、性能优化和可扩展性。本文将从技术角度讨论如何运用云原
分享一个 ASP.NET Web Api 上传和读取 Excel的方案代码掌控者 C#asp.net asp.net excel c#
前言许多业务场景下需要处理和分析大量的数据，而Excel是业务人员常用的数据表格工具，因此，将Excel表格中内容上传并读取到网站，是一个很常见的功能，目前有许多成熟的开源或者商业的第三方库，比如NPOI，EPPlus，Spire.Officefor.NET等等，今天分享一个使用Magicodes.IE.Excel上传和读取Excel的方案，这是近年来一个比较受欢迎的开源的第三方库，下面我们用一个
【漏洞预警】FortiOS 和 FortiProxy 身份认证绕过漏洞(CVE-2024-55591) 李火火安全阁漏洞预警 Fortinet
文章目录一、产品简介二、漏洞描述三、影响版本四、漏洞检测方法五、解决方案一、产品简介FortiOS是Fortinet公司核心的网络安全操作系统，广泛应用于FortiGate下一代防火墙，为用户提供防火墙、VPN、入侵防御、应用控制等多种安全功能。FortiProxy则是Fortinet提供的企业级安全代理产品，主要用于内容过滤、Web访问控制和数据安全防护等场景。下一代防火墙产品FortiGate
oracle创建用户，授权connect，resource后无法建表程序员WANG 数据库 oracle 数据库 dba
oracle创建用户后，授权很重要，grantconnect,resourcetodemo后，如果你觉得可以了，那就错了。具体授权分为三种方式：1、授权管理员权限，即grantconnect,resource,dbato用户；2、先划分角色，已分配权限的角色，授权给用户，grantrole1to用户，用户就拥有了该角色的权限;3、主要用的就是直接授权。详情如下：1.建用户createuserdem
如何使用 Python 和 Selenium WebDriver 获取 localStorage 潮易 python selenium 开发语言
如何使用Python和SeleniumWebDriver获取localStorage要使用Python和SeleniumWebDriver获取localStorage，您可以遵循以下步骤：###1.安装必要的库首先，您需要安装selenium库。可以通过pip进行安装：```bashpipinstallselenium```###2.下载WebDriver根据您的浏览器类型（如Chrome、Fir
使用迭代工具返回连续负数的最长列表。groupby 潮易 python
使用迭代工具返回连续负数的最长列表。groupby要使用Python编程解决这个问题，我们可以采用迭代和条件判断的方法。以下是一个简单的实现方法：```pythondeflongest_negatives(nums):max_length=0current_length=0start=-1foriinrange(len(nums)):ifnums[i]<0:ifcurrent_length==0:
Python 删除文件与文件夹 - 奇客谷教程八狐云|酷画册|二维码生成 python Python 教程 python
Python教程Python介绍Python开发环境搭建Python语法Python变量Python数值类型Python类型转换Python字符串(String)Python运算符Python列表(list)Python元组(Tuple)Python集合(Set)Python字典(Dictionary)PythonIf…ElsePythonWhile循环PythonFor循环Python函数Pyt
最新全开源IM即时通讯系统源码(PC+WEB+IOS+Android)部署指南 m0_74824823 开源前端 android
全开源IM（即时通讯）系统源码部署是一个复杂但系统的过程，涉及多个组件和步骤。以下是一个详细的部署指南，旨在帮助开发者或系统管理员成功部署一个全开源的IM系统，如OpenIM。IM即时通讯系统源码准备工作1.选择合适的IM系统源码及演示：ms.jstxym.top在部署之前，首先需要选择一个合适的全开源IM系统，在演示站找到合适的源码。OpenIM是一个广泛使用的开源IM解决方案，它提供了IM服务
Day38 WEB 攻防-通用漏洞&XSS 跨站&绕过修复&http_only&CSP&标签符号李八八8864 前端 xss http
#知识点1、XSS跨站-过滤绕过-便签&语句&符号等2、XSS跨站-修复方案-CSP&函数&http_only等演示案例ØXSS绕过-CTFSHOW-361到331关卡绕过WPØXSS修复-过滤函数&http_only&CSP&长度限制#XSS绕过-CTFSHOW-361到331关卡绕过WP绕过：XSS总结-先知社区(aliyun.com)316-反射型-直接远程调用window.location
ubuntu 安装 docker 2301_78094384 ubuntu docker linux
打开网站：Ubuntu|DockerDocs卸载冲突包forpkgindocker.iodocker-docdocker-composedocker-compose-v2podman-dockercontainerdrunc;dosudoapt-getremove$pkg;done运行以下命令#AddDocker'sofficialGPGkey:sudoapt-getupdatesudoapt-g
http和https分别是什么？区别是什么？神明木佑 http https 网络协议
HTTP和HTTPS是两种常见的网络协议，用于在Web上进行数据传输。以下是它们的简要解释和主要区别：HTTP（HypertextTransferProtocol）HTTP是一种应用层协议，用于在Web上传输数据。它是互联网上应用最为广泛的一种网络协议，所有的WWW文件都必须遵守这个标准。HTTP协议通常承载于TCP协议之上，有时也承载于TLS或SSL协议层之上，这个时候，就成了我们常说的HTTP
生命周期函数——created、onload、mounted、updated的执行顺序编程星空前端 javascript vue.js
created和onload是非常重要的生命周期函数，涉及到组件初始化和数据绑定的顺序。created：（1）created是在Vue实例创建完成后立即被执行的。（2）在created中我们可以访问到组件的数据和方法，并进行一些初始化操作。此时的this指向VueComponent（其中包含所有的组件数据和方法）（3）通常我们会在created函数中发送请求获取数据，并将其存储在组件的data中。
基于深度学习的推荐系统构建：Movielens 数据集 fresh的转码之路深度学习人工智能机器学习推荐算法
基于深度学习的推荐系统构建：Movielens数据集依赖环境代码语言：python3.11.5开发平台：pycharmtensorflow版本：2.18.0MovieLen1M数据及简介MovieLens1M数据集包含包含6000个用户在近4000部电影上的100万条评分，也包括电影元数据信息和用户属性信息。下载地址为：http://files.grouplens.org/datasets/mov
Easysearch Rollup 使用指南数据库搜索引擎
背景在现代数据驱动的世界中，时序数据的处理变得越来越重要。无论是监控系统、日志分析，还是物联网设备的数据收集，时序数据都占据了大量的存储空间。随着时间的推移，这些数据的存储成本和管理复杂度也在不断增加。为了解决这一问题，Rollup技术应运而生。本文将带你深入了解Rollup的概念、优势以及如何在Easysearch中使用Rollup来优化时序数据的存储和查询。什么是Rollup？Rollup是一
如何从0开始写一个操作系统 c后端
本贴用来记录作者用c语言写一个操作系统，主要参考《操作系统真相还原》一书写的，同时也会对书里的代码和linux进行对比，尽量看一下现代操作系统中是如何实现的。原书的代码https://github.com/yifengyou/os-elephant/tree/master我会挑一些说说传统的操作系统课一般从内存，虚拟化等等方面讲起，因为是自己实现操作系统，肯定不能一上来就写开始写内存管理这种大活，
LoadRunner如何监控Linux系统资源
使用LoadRunner监控Linux系统资源的步骤详解LoadRunner是一个广泛使用的性能测试工具，支持对Web应用程序、服务器和整个基础设施进行负载测试。在监控Linux系统资源方面，LoadRunner通过集成监控代理（Agent）来收集并显示有关CPU、内存、磁盘等系统资源的详细信息。以下是通过LoadRunner监控Linux系统资源的详细步骤：1.安装LoadRunnerAgent
jmeter录制过滤_Jmeter录制pc脚本 weixin_39757040 jmeter录制过滤
1.打开jmeter后可以看到左边窗口有个“测试计划”和“工作台”，右键“测试计划”，添加Threads(Users)→线程组，再右键线程组→添加配置元件→Http请求默认值Http请求默认值窗口下---在web服务器处的“服务器名称或IP”填上网址或IP(本地就填localhost的IP，端口填你部署的服务器端口，路径就填写域名后面的路径。2.可以有可以无。[作用：清楚所有录制的记录信息]3.右
用python操作浏览器的三种方式_经验 | python 操作浏览器的三种方式 weixin_39642619
第一种：selenium导入浏览器驱动，用get方法打开浏览器，例如：importtimefromseleniumimportwebdriverdefmac():#browser=webdriver.Chrome()#browser=webdriver.Firefox()browser=webdriver.Ie()browser.implicitly_wait(5)browser.get("htt
Flask 和阿里云 OSS 实现文件上传功能 ivwdcwso 开发 flask 阿里云 python oss
在本教程中,我们将学习如何使用Flask框架和阿里云对象存储服务(OSS)来创建一个简单而强大的文件上传应用。这个应用将允许用户通过Web界面上传文件,然后将文件安全地存储到阿里云OSS中,并返回可访问的文件URL。准备工作在开始之前,请确保您已经完成以下准备工作:安装Python(推荐Python3.7+)安装Flask:pipinstallflask安装阿里云OSSSDK:pipinstall
百度指数+selenium+request+比特指纹浏览器+pywebview+pandas+flask过程性万山y python selenium 爬虫 flask pandas
1.cookies和headrs问题使用selenium获得的cookies测试没有问题，但是获得的heards头不可以使用，经过测试比较需要添加或者修改几项重点的heards为{'Cipher-Text':'1704885072633_1704970047346_SlMkwPX0ZnotTaSrpOEx50xhLlPT5iMH867nxTtYuapcdPhsh2d2ooVE2F+RSm+yhIF
[ERROR] Malformed \uxxxx encoding.报错解决 Light__Chaser java
1、检查项目的.properties、.yml、pom.xml、logback等配置中，是否有路径错误使用2、更新maven仓库，重新下载jar包（没必要）可以将一些没下载成功的jar包重新下载，一般下载不成功的依赖，都会生成一个后缀未.lastupdated的文件，而且有这个文件一旦生成，那个依赖就会一直下载不成功，无论怎么reloadmaven仓库，都下载不成功。解决办法在文件资源管理器中找到
在CentOs上安装Docker，Docker中配置MYSQL，安装java Light__Chaser 微服务 java linux
在CentOs上安装Docker1.更新系统在安装Docker之前，建议先更新系统以确保所有软件包都是最新的。sudoyumupdate-y2.安装依赖包在CentOS上安装Docker需要一些额外的依赖工具。sudoyuminstall-yyum-utilsdevice-mapper-persistent-datalvm23.添加Docker仓库sudoyum-config-manager--a
【Python爬虫实战】轻量级爬虫利器：DrissionPage之SessionPage与WebPage模块详解易辰君 python爬虫 python 爬虫开发语言
个人主页：易辰君-CSDN博客系列专栏：https://blog.csdn.net/2401_86688088/category_12797772.html目录前言一、SessionPage（一）SessionPage模块的基本功能（二）基本使用（三）常用方法（四）页面元素定位和数据提取（五）Cookie和会话管理（六）SessionPage的优点和局限性（七）SessionPage和Driver
【Python爬虫实战】全面解析 DrissionPage：简化 Python 浏览器自动化的三种模式易辰君 python爬虫 python 爬虫开发语言
个人主页：易辰君-CSDN博客系列专栏：https://blog.csdn.net/2401_86688088/category_12797772.html目录前言一、DrissionPage简介（一）ChromiumPage（二）WebPage（三）SessionPage（四）三大模块总结二、ChromiumPage（一）初始化ChromiumPage（二）基本操作（三）等待元素加载（四）执行J
python中drop用法去重_如何使用drop_duplicates进行简单去重（入门篇） weixin_39991055 python中drop用法去重
什么是去重呢？简单来说，数据去重指的是删除重复数据。在一个数字文件集合中，找出重复的数据并将其删除，只保存唯一的数据单元。在我们的数据预处理过程中，这是一项我们经常需要进行的操作。去重有哪些好处？节省存储空间提升写入性能提高模型精度今天我们就来简单介绍一下，在pandas中如何使用drop_duplicates进行去重。一、函数体及主要参数函数体：df.drop_duplicates(subset
linux进程sl状态,linux进程状态s和sl的区别 weixin_39830688 linux进程sl状态
PROCESSSTATECODESHerearethedifferentvaluesthatthes,statandstateoutputspecifiers(header"STAT"or"S")willdisplaytodescribethestateofaprocess:Duninterruptiblesleep(usuallyIO)IIdlekernelthreadRrunningorrun
linux进程状态 DW,c - 诊断进程陷入D状态（不间断睡眠/阻塞IO） - SO中文参考 - www.soinside.com... 咔咔鲁斯 linux进程状态 DW
我们正在开发一个嵌入式Linux系统，使用Live555WIS-Streamer通过网络在RTSP上传输视频。在一个特定的系统中，我们看到WIS-Streamer卡在TASK_UNINTERRUPTIBLE状态;从命令行：进程的ps状态显示为DW，WIS进程的子进程都列为Zombie状态。一旦我们处于这种状态，看起来我们无能为力，除了重启(不可取)。然而，我们真的很想找到这个的根本原因-我怀疑在流
web报表工具FineReport常见的数据集报错错误代码和解释老A不折腾 web报表 finereport 代码可视化工具
在使用finereport制作报表，若预览发生错误，很多朋友便手忙脚乱不知所措了，其实没什么，只要看懂报错代码和含义，可以很快的排除错误，这里我就分享一下finereport的数据集报错错误代码和解释，如果有说的不准确的地方，也请各位小伙伴纠正一下。 NS-war-remote=错误代码\:1117 压缩部署不支持远程设计 NS_LayerReport_MultiDs=错误代码
Java的WeakReference与WeakHashMap bylijinnan java 弱引用
首先看看 WeakReference wiki 上 Weak reference 的一个例子： public class ReferenceTest { public static void main(String[] args) throws InterruptedException { WeakReference r = new Wea
Linux——（hostname）主机名与ip的映射 eksliang linux hostname
一、什么是主机名无论在局域网还是INTERNET上，每台主机都有一个IP地址，是为了区分此台主机和彼台主机，也就是说IP地址就是主机的门牌号。但IP地址不方便记忆，所以又有了域名。域名只是在公网（INtERNET)中存在，每个域名都对应一个IP地址，但一个IP地址可有对应多个域名。域名类型 linuxsir.org 这样的；主机名是用于什么的呢？答：在一个局域网中，每台机器都有一个主
oracle 常用技巧 18289753290
oracle常用技巧 ①复制表结构和数据 create table temp_clientloginUser as select distinct userid from tbusrtloginlog ②仅复制数据如果表结构一样 insert into mytable select * &nb
使用c3p0数据库连接池时出现com.mchange.v2.resourcepool.TimeoutException 酷的飞上天空 exception
有一个线上环境使用的是c3p0数据库，为外部提供接口服务。最近访问压力增大后台tomcat的日志里面频繁出现 com.mchange.v2.resourcepool.TimeoutException: A client timed out while waiting to acquire a resource from com.mchange.v2.resourcepool.BasicResou
IT系统分析师如何学习大数据蓝儿唯美大数据
我是一名从事大数据项目的IT系统分析师。在深入这个项目前需要了解些什么呢？学习大数据的最佳方法就是先从了解信息系统是如何工作着手，尤其是数据库和基础设施。同样在开始前还需要了解大数据工具，如Cloudera、Hadoop、Spark、Hive、Pig、Flume、Sqoop与Mesos。系统分析师需要明白如何组织、管理和保护数据。在市面上有几十款数据管理产品可以用于管理数据。你的大数据数据库可能
spring学习——简介 a-john spring
Spring是一个开源框架，是为了解决企业应用开发的复杂性而创建的。Spring使用基本的JavaBean来完成以前只能由EJB完成的事情。然而Spring的用途不仅限于服务器端的开发，从简单性，可测试性和松耦合的角度而言，任何Java应用都可以从Spring中受益。其主要特征是依赖注入、AOP、持久化、事务、SpringMVC以及Acegi Security 为了降低Java开发的复杂性，
自定义颜色的xml文件 aijuans xml
<?xml version="1.0" encoding="utf-8"?> <resources> <color name="white">#FFFFFF</color> <color name="black">#000000</color> &
运营到底是做什么的？ aoyouzi 运营到底是做什么的？
文章来源：夏叔叔（微信号：woshixiashushu），欢迎大家关注！很久没有动笔写点东西，近些日子，由于爱狗团产品上线，不断面试，经常会被问道一个问题。问：爱狗团的运营主要做什么？答：带着用户一起嗨。为什么是带着用户玩起来呢？究竟什么是运营？运营到底是做什么的？那么，我们先来回答一个更简单的问题——互联网公司对运营考核什么？以爱狗团为例，绝大部分的移动互联网公司，对运营部门的考核分为三块——用
js面向对象类和对象百合不是茶 js 面向对象函数创建类和对象
接触js已经有几个月了,但是对js的面向对象的一些概念根本就是模糊的,js是一种面向对象的语言但又不像java一样有class,js不是严格的面向对象语言 ,js在java web开发的地位和java不相上下 ,其中web的数据的反馈现在主流的使用json,json的语法和js的类和属性的创建相似下面介绍一些js的类和对象的创建的技术一:类和对
web.xml之资源管理对象配置 resource-env-ref bijian1013 java web.xml servlet
resource-env-ref元素来指定对管理对象的servlet引用的声明，该对象与servlet环境中的资源相关联 <resource-env-ref> <resource-env-ref-name>资源名</resource-env-ref-name> <resource-env-ref-type>查找资源时返回的资源类
Create a composite component with a custom namespace sunjing
https://weblogs.java.net/blog/mriem/archive/2013/11/22/jsf-tip-45-create-composite-component-custom-namespace When you developed a composite component the namespace you would be seeing would
【MongoDB学习笔记十二】Mongo副本集服务器角色之Arbiter bit1129 mongodb
一、复本集为什么要加入Arbiter这个角色回答这个问题，要从复本集的存活条件和Aribter服务器的特性两方面来说。什么是Artiber？ An arbiter does not have a copy of data set and cannot become a primary. Replica sets may have arbiters to add a
Javascript开发笔记白糖_ JavaScript
获取iframe内的元素通常我们使用window.frames["frameId"].document.getElementById("divId").innerHTML这样的形式来获取iframe内的元素，这种写法在IE、safari、chrome下都是通过的，唯独在fireforx下不通过。其实jquery的contents方法提供了对if
Web浏览器Chrome打开一段时间后，运行alert无效 bozch Web chorme alert 无效
今天在开发的时候，突然间发现alert在chrome浏览器就没法弹出了，很是怪异。试了试其他浏览器，发现都是没有问题的。开始想以为是chorme浏览器有啥机制导致的，就开始尝试各种代码让alert出来。尝试结果是仍然没有显示出来。这样开发的结果，如果客户在使用的时候没有提示，那会带来致命的体验。哎，没啥办法了就关闭浏览器重启。结果就好了，这也太怪异了。难道是cho
编程之美-高效地安排会议图着色问题贪心算法 bylijinnan 编程之美
import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.Random; public class GraphColoringProblem { /**编程之美高效地安排会议图着色问题贪心算法 * 假设要用很多个教室对一组
机器学习相关概念和开发工具 chenbowen00 算法 matlab 机器学习
基本概念：机器学习(Machine Learning, ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。它是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域，它主要使用归纳、综合而不是演绎。开发工具 M
[宇宙经济学]关于在太空建立永久定居点的可能性 comsci 经济
大家都知道,地球上的房地产都比较昂贵,而且土地证经常会因为新的政府的意志而变幻文本格式........ 所以,在地球议会尚不具有在太空行使法律和权力的力量之前,我们外太阳系统的友好联盟可以考虑在地月系的某些引力平衡点上面,修建规模较大的定居点
oracle 11g database control 证书错误 daizj oracle 证书错误 oracle 11G 安装
oracle 11g database control 证书错误 win7 安装完oracle11后打开 Database control 后，会打开em管理页面，提示证书错误，点“继续浏览此网站”，还是会继续停留在证书错误页面解决办法：是 KB2661254 这个更新补丁引起的，它限制了 RSA 密钥位长度少于 1024 位的证书的使用。具体可以看微软官方公告：
Java I/O之用FilenameFilter实现根据文件扩展名删除文件游其是你 FilenameFilter
在Java中，你可以通过实现FilenameFilter类并重写accept(File dir, String name) 方法实现文件过滤功能。在这个例子中，我们向你展示在“c:\\folder”路径下列出所有“.txt”格式的文件并删除。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
C语言数组的简单以及一维数组的简单排序算法示例，二维数组简单示例 dcj3sjt126com c array
# include <stdio.h> int main(void) { int a[5] = {1, 2, 3, 4, 5}; //a 是数组的名字 5是表示数组元素的个数，并且这五个元素分别用a[0], a[1]...a[4] int i; for (i=0; i<5; ++i) printf("%d\n",
PRIMARY, INDEX, UNIQUE 这3种是一类 PRIMARY 主键。就是唯一且不能为空。 INDEX 索引，普通的 UNIQUE 唯一索引 dcj3sjt126com primary
PRIMARY, INDEX, UNIQUE 这3种是一类PRIMARY 主键。就是唯一且不能为空。INDEX 索引，普通的UNIQUE 唯一索引。不允许有重复。FULLTEXT 是全文索引，用于在一篇文章中，检索文本信息的。举个例子来说，比如你在为某商场做一个会员卡的系统。这个系统有一个会员表有下列字段：会员编号 INT会员姓名
java集合辅助类 Collections、Arrays shuizhaosi888 Collections Arrays HashCode
Arrays、Collections 1 ）数组集合之间转换 public static <T> List<T> asList(T... a) { return new ArrayList<>(a); } a）Arrays.asL
Spring Security（10）——退出登录logout 234390216 logout Spring Security 退出登录 logout-url LogoutFilter
要实现退出登录的功能我们需要在http元素下定义logout元素，这样Spring Security将自动为我们添加用于处理退出登录的过滤器LogoutFilter到FilterChain。当我们指定了http元素的auto-config属性为true时logout定义是会自动配置的，此时我们默认退出登录的URL为“/j_spring_secu
透过源码学前端之 Backbone 三 Model 逐行分析JS源代码 backbone 源码分析 js学习
Backbone 分析第三部分 Model 概述： Model 提供了数据存储，将数据以JSON的形式保存在 Model的 attributes里，但重点功能在于其提供了一套功能强大，使用简单的存、取、删、改数据方法，并在不同的操作里加了相应的监听事件，如每次修改添加里都会触发 change，这在据模型变动来修改视图时很常用，并且与collection建立了关联。
SpringMVC源码总结（七）mvc:annotation-driven中的HttpMessageConverter 乒乓狂魔 springMVC
这一篇文章主要介绍下HttpMessageConverter整个注册过程包含自定义的HttpMessageConverter，然后对一些HttpMessageConverter进行具体介绍。 HttpMessageConverter接口介绍： public interface HttpMessageConverter<T> { /** * Indicate
分布式基础知识和算法理论 bluky999 算法 zookeeper 分布式一致性哈希 paxos
分布式基础知识和算法理论 BY [email protected] 本文永久链接：http://nodex.iteye.com/blog/2103218 在大数据的背景下，不管是做存储，做搜索，做数据分析，或者做产品或服务本身，面向互联网和移动互联网用户，已经不可避免地要面对分布式环境。笔者在此收录一些分布式相关的基础知识和算法理论介绍，在完善自我知识体系的同
Android Studio的.gitignore以及gitignore无效的解决 bell0901 android gitignore
　　github上.gitignore模板合集，里面有各种.gitignore ： https://github.com/github/gitignore 　　自己用的Android Studio下项目的.gitignore文件，对github上的android.gitignore添加了　　　　　　# OSX files　　　　　　//mac os下　　　　　　.DS_Store
成为高级程序员的10个步骤 tomcat_oracle 编程
What 软件工程师的职业生涯要历经以下几个阶段：初级、中级，最后才是高级。这篇文章主要是讲如何通过 10 个步骤助你成为一名高级软件工程师。 Why 得到更多的报酬！因为你的薪水会随着你水平的提高而增加提升你的职业生涯。成为了高级软件工程师之后，就可以朝着架构师、团队负责人、CTO 等职位前进历经更大的挑战。随着你的成长，各种影响力也会提高。
mongdb在linux下的安装 xtuhcy mongodb linux
一、查询linux版本号： lsb_release -a LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noa