A System for Collecting and Analyzing Topic-Specific Web Information

WTMS: A System for Collecting and Analyzing Topic-Specific Web Information

Sougata Mukherjea,
C&C Research Laboratories,
NEC USA Inc.,
San Jose, Ca, USA

Abstract:

With the explosive growth of the World-Wide Web, it is becoming increasingly difficult for users to collect and analyze Web pages that are relevant to a particular topic. To address this problem we are developing WTMS, a system for Web Topic Management. In this paper we explain how the WTMS crawler efficiently collects Web pages for a topic. We also introduce the user interface of the system that integrates several techniques for analyzing the collection. Moreover, we present the various views of the interface that allow navigation through the information space. We highlight several examples to show how the system enables the user to gain useful insights about the collection.

Keywords: World-Wide Web, Topic Management, Focussed Crawling, Information Visualization, Graph Algorithms, Hubs, Authorities

1. Introduction

The World-Wide Web is undoubtedly the best source for getting information on any topic. Therefore, more and more people use the Web for topic management [1], the task of gathering, evaluating and organizing information resources on the Web. Users may investigate topics both for professional or personal interests.

Generally the popular portals or search engines like Yahoo and Alta Vista are used for gathering information on the WWW. However, with the explosive growth of the Web, topic management is becoming an increasingly difficult task. On the one hand this leads to a large number of documents being retrieved for most queries. The results are presented as pages of scrolled lists. Going through these pages to retrieve the relevant information is tedious. Moreover, the Web has over 350 million pages and continues to grow rapidly at a million pages per day [4]. Such growth and flux pose basic limits of scale for today's generic search engines. Thus, many relevant information may not have been gathered and some information may not be up-to-date.

Because of these problems, recently there is much awareness that for serious Web users, focussed portholes are more useful than generic portals [9]. Therefore, systems that allow the user to collect and organize the information related to a particular topic and allow easy navigation through this information space is becoming essential. Such a Web Topic Management System should have several features to be really useful:

Focussed Crawler:
A crawler that allows the collection of resources from the WWW that relates to a user-specified topic is an essential requirement of the system. An effective crawler should have high precision (all the retrieved Web pages should belong to the topic) and recall (most of the relevant information available on the topic should be gathered). Moreover the crawler should be efficient; the relevant information should be collected in the least amount of time possible.
Viewing information at various levels of abstraction:
A search engine generally shows the relevant Web pages as the result of a user's query. However, sometimes authors organize the information as a collection of pages; in these cases presenting the collection may be more useful for the reader. In fact, if a Web site has many relevant pages, presenting the Web site itself may be better. Furthermore, if there are many Web sites for a particular topic, grouping similar Web sites may be more convenient. Therefore, an effective Topic Management system should be able to present the information at various levels of abstraction depending on the user's focus.
Integrate querying and browsing:
The two major ways to access information in the Web are querying and browsing. Querying is appropriate when the user has a well-defined understanding of what information is needed and how to formulate a query. However, in many cases the user is not certain of exactly what information is desired and needs to learn more about the content of the information space. In these cases browsing is an ideal navigational strategy. Browsing can also be combined with querying when the result of the query is too large for the user to comprehend (by letting the user browse through the results) or too small (by showing the user other related information). An effective Topic Management system should allow the user to smoothly integrate querying and browsing to retrieve the relevant information.
Beyond keyword-based querying:
For advanced analysis of a topic, the keyword-based queries that are provided by the search engines may not be adequate. More sophisticated querying techniques based on the contents as well as the structure of the information space are essential. For example, the user should be able to group together related information and filter out unimportant information.
Allow easy sharing of the gathered information:
Once the collection for a particular topic has been gathered and organized, different users with interest in the topic should be able to share the information.

We are building WTMS, a Web Topic Management System to allow the collection and analysis of information on the Web related to a particular topic. This paper discusses the various features of the system. The next section cites related work. Section 3 explains how the focussed crawler of WTMS allows the collection of information from the WWW relevant to a topic. Section 4 presents the various views of the system that allow the user to navigate through the information space. Several graph algorithm based techniques to analyze the collection are introduced in section 5. Finally, section 6 is the conclusion.

2. Related Work

2.1. Visualizing the World-Wide Web

Several systems for visualizing WWW sites have been developed. Examples includes Navigational View Builder [16], Harmony Internet Browser [2] and Narcissus [11]. Visualization techniques for World-Wide Web search results are also beingdeveloped. For example, the WebQuery system [7] visualizes the results of a search query along with all pages that link to or are linked to by any page in the original result set. Another example is WebBook [6], which potentially allows the results of the search to be organized and manipulated in various ways in a 3D space. In this paper our emphasis is to develop views that allow navigation through Web pages about a particular topic and gain useful insights about the collection.

2.2. World-Wide Web Topic Management

In recent times there has been much interest in collecting Web pages related to a particular topic. The shark-search algorithm for collecting topic-specific pages is presented in [12]. Another focussed crawler is presented in [9]. The crawler used in WTMS is similar to these systems. However, it uses several heuristics to improve performance.

Another interesting problem is determining the important pages in a collection. Kleinberg defines the HITS algorithm to identify authority and hub pages in a collection [13]. Authority pages are authorities on a topic and hubs point to many pages relevant to the topic. Therefore, pages with many in-links, specially from hubs, are considered to be good authorities. On the other hand, pages with many out-links, specially from authorities, are considered to be good hubs. The algorithm has been refined in CLEVER [8] and Topic Distillation [5]. We feel that this algorithm is very important for a Web Topic Management system. However, we also believe that determining the hub and authority sites for a topic are more useful than determining the hub and authority pages.

Mapuccino (formerly WebCutter) [15], [12], [3] and TopicShop [20], [1] are two systems that have been developed for WWW Topic Management. Both systems use a crawler for collecting Web pages related to a topic and use various types of visualization to allow the user to navigate through the resultant information space. While Mapuccino presents the information as a collection of Web pages, TopicShop presents the information as a collection of Web sites. As emphasized in the introduction, we believe that it is more effective to present the information at various levels of abstraction depending on the user's focus. Moreover, a topic management system should allow the user to use several techniques to analyze the information space.

3. Crawling

The architecture of the Web Topic Management System is discussed in [18]. The system uses a focussed crawler to collect Web pages related to a user-specified topic. In this section we will discuss the basic strategy for focussed crawling and how we can improve performance by using some heuristics.

3.1. Basic Focussed Crawling Technique

For collecting WWW pages related to a particular topic the user has to first specify some seed urls relevant to the topic. Alternatively, the user can specify the keywords for the topic and the crawler issues a query with the specified keywords to a popular Web search engine and uses the results as the seed urls. The WTMS crawler downloads the seed urls and creates a representative document vector (RDV) based on the frequently occurring keywords in these urls.

The crawler then downloads the pages that are referenced from the seed urls and calculates their similarity to the RDV using the vector space model [19]. If the similarity is above a threshold, the pages are indexed by a text search engine and the links from the pages are added to a queue. The crawler continues to follow the out-links until the queue is empty or an user-specified limit is reached.

The crawler also determines the pages pointing to the seed urls. (A query link:u to search engines like AltaVista and Google returns all pages pointing to url u). These pages are downloaded and if their similarity to the RDV is greater than a threshold, they are indexed and the urls pointing to these pages are added to the queue. The crawler continues to follow the in-links until the queue is empty or an user-specified limit is reached.

After crawling, the collection consists of the seed urls as well as all pages similar to these seed urls that have paths to or from the seeds. We believe that this collection is a good source of information available on the WWW for the user-specified topic. It should be noted that the crawler has a stop url list to avoid downloading some popular pages (like Yahoo's and Netscape's Home pages) as part of the collection.

This focussed crawling strategy is similar to the techniques described in [12] and [9]. Most focussed crawlers start with some seed urls and follows the out-links (sometimes in-links also). Pages that are relevant to the topic of interest are indexed and the links from these pages are also followed. The relevancy can be determined by various techniques like the vector space model (as in [12]) or using a classifier (for example in [9]).

3.2. Using Heuristics to Improve Performance

The main bottleneck of a crawler is the time spent in downloading Web pages. Besides network congestion, a crawler needs to follow the convention of issuing one download request to a site per 30 seconds. Therefore, downloading many pages from a single site may take a long time. For a focussed crawler only Web pages relevant to the topic are important. So many of the downloaded pages may have to be discarded. In fact, using our basic focussed crawler for topics as diverse as World Cup Soccer, Information Visualization and Titanic we found that less than 50% of the downloaded pages were found to be relevant. If we could determine that a page will be irrelevant without examining the contents of the page, we can avoid downloading the page, thereby improving performance. We use two heuristics for the purpose.

3.2.1. Nearness of the current page to the linked page

If a web site has information related to several topics, a page in the Web site for one of the topics may have links to pages relating to the other topics or to the main page of the site for ease of navigation. For example, the page http://www.discovery.com/area/science/titanic/weblinks.html, a page relevant to Titanic has a link to http://www.discovery.com/online.html, the main page of Discovery online, a page not related to the topic. However, since most web sites are well organized, pages that are dissimilar do not occur near each other in the directory hierarchy.

Therefore, when we are examining the pages linked to or linked from a Web page, we need to download the linked page only if it is near the current page. The determination of nearness will be an optimization between the number of irrelevant downloads and the number of relevant pages that are not downloaded. If we use a strict criteria to determine the nearness between two pages, the number of downloads will be lower, but we may miss some relevant pages. On the other hand, a lenient criteria to determine nearness will retrieve all the relevant pages but at the cost of increasing the number of downloads.

A System for Collecting and Analyzing Topic-Specific Web Information

Figure 1: Determining the nearness of two pages in the directory hierarchy

Figure 1 shows how nearness is determined by the WTMS crawler. Suppose a page in the directory A/B is the current page. Then pages in the same web site are considered to be near (and therefore downloaded) if and only if they belong to the directories shown in the figure. Thus pages in the parent directory (A) as well as any children directories (C,D) are considered to be near. Sections of sibling directories (E,F,G) are also downloaded. After crawling several web sites, we found that this definition of nearness gives the optimal result. It should be noted that if a page has a link to or from a page in another Web site, we have to download the page (unless it is in the stop url list). Also note that if a url contains any of the topic keywords, the page will be always downloaded. So all pages from http://www.titanicmovie.com will be downloaded for the Titanic collection.

3.2.2. Irrelevant Directories

Because web sites are well organized, generally most pages in the same directory have similar themes. Thus all pages in http://www.murthy.com/txlaw/ talk about tax laws. One of these pages, http://www.murthy.com/txlaw/txwomsoc.html was retrieved by a query to Google with the keywords "world cup soccer" since it talks about visa issuance to the Women's World Cup Soccer tournament. However, none of the other pages of the directory are relevant to the collection on World Cup Soccer. However, in the basic crawler all these pages were downloaded, only to be discarded after determining the similarity to RDV.

To avoid this problem, during crawling, we keep a count on the number of pages that have been indexed and ignored for each directory. If more than 25 pages of a directory are downloaded and 90% of those pages are ignored, we do not download any more pages from that directory.

3.3. Evaluation

Collections	Information Visualization	World Cup Soccer	Titanic
% Download	73.54	66.27	77.4
% Nearness	21.9	27.59	15.7
% Rejected Directories	4.56	6.14	6.9
% Relevant Pages Missed	4.26	4.04	2.56
Average Score Pages Missed	0.261	0.279	0.254

Table 1: Effectiveness of the heuristics in improving crawling performance

Table 1 shows the comparison between the basic crawler and the enhanced crawlers for 3 collections, Information Visualization, World Cup Soccer and Titanic. The following statistics are shown:

% Download = 100*(Number of pages downloaded by enhanced crawler)/(Number of pages downloaded by basic crawler)
For all the 3 collections there is a significant decrease in the number of urls that were downloaded. Besides reducing the network overhead, the amount of time needed to wait between successive requests to the same server is also decreased.
% Nearness is the number of pages not downloaded because the linked page was not near the original page.
A significant number of pages were not downloaded using this heuristic.
% Rejected Directories is the number of pages not downloaded because a significant number of pages from the directory the page belonged to, were already rejected after downloading.
Unfortunately, only a small number of pages were ignored using this criteria. Maybe extending the criteria to determine irrelevant sites instead of irrelevant directories may be more useful.
% Relevant Pages Missed = 100 * (Number of relevant pages indexed by the basic crawler but ignored by the enhanced crawler)/ (Total number of pages indexed by the basic crawler)
This statistics shows the number of relevant pages that were ignored by the enhanced crawler. It indicates that most of the pages that were not downloaded by the enhanced crawler were not relevant to the topic and were not indexed by the basic crawler also.
Average Score Pages Missed is the average score of the relevant pages indexed by the basic crawler but ignored by the enhanced crawler.
Even if the enhanced crawler misses some relevant pages, for the crawler to be acceptable, none of these pages should be very important to the collection. To measure the importance of the missed pages, we calculate the average similarity scores of these pages. In our crawler all pages whose similarity to the representative document vector is greater than 0.25 are indexed (we use a vector space model where the similarity of a page is a number between 0 and 1). Since the average score of the missed pages is close to 0.25, it shows that these pages were not the most important pages of the collection.

Thus, our enhancements were able to significantly reduce the download time without missing many relevant pages of the collection.

4. Interfaces for Topic Management

After the information about a particular topic has been crawled and indexed, a WTMS server is initialized. Any user can access the WTMS server from a Web browser. The WTMS user interface is opened in the client browser as a Java (Swing) applet. Since most users won't require all the collected information, the applet initially loads only a minimal amount of information. Based on user interactions, the applet can request further information from the WTMS server. XML is used for information exchange between the WTMS server and clients.

The WTMS interface provides various types of views to allow navigation through the gathered information space. This section gives a brief description of the WTMS interface applet.

4.1. Tabular Overview

In WTMS the collected information is organized into an abstraction hierarchy as discussed in [18]. The Web pages relevant to a topic downloaded by the crawler are considered to be the smallest unit of information in the system. The pages can be grouped into their respective physical domains; for example all pages in www.cnn.com can be grouped together. Many large corporations use several Web servers based on functionality or geographic location. For example, www.nec.com and www.nec.co.jp are the Web sites of NEC Corporation in US and Japan respectively. Similarly, shopping.yahoo.com is the shopping component of the Yahoo www.yahoo.com portal. Therefore, we group together related physical domains into logical Web sites by analyzing the urls. (In other cases we may need to break down the domains into smaller logical Web sites. For example, for corporations like Geocities and Tripod who provide free home pages, the Web sites www.geocities.com and members.tripod.com are actually a collection of home pages with minimal relationship among each other. However, this technique has not yet been incorporated into WTMS.)

Most HTML authors organize the information into a series of HTML pages for ease of navigation. Moreover, readers generally want to know what is available on a given site, not on a single page. Therefore, logical Web sites are the basic unit of information in WTMS. Thus, for calculating the hub and authority scores, we apply the algorithm introduced in [13] to logical sites instead of individual Web pages.

Figure 2: A table showing the logical sites relevant to a collection on World Cup Soccer sorted by the authority scores

Figure 2 shows the initial view of the WTMS interface for a collection on World Cup Soccer. It shows a table containing various information about the logical Web sites that were discovered for the topic. Besides the url and the number of pages, it shows the hub and authority scores. (These scores have values between 0 and 1). The table also shows the title of the main page of the sites. The main page is determined by the connectivity and the depth of the page in the site hierarchy as discussed in [17]. Note that generally while calculating the hub and authority scores, intra-site links are ignored [5]. Assuming that all Web pages within a site are by the same author, this removes the author's judgment while determining the global importance of a page within the overall collection. However, while determining the local importance of a page within a site, the author's judgment is also important. So the intra-site links are not ignored while determining the main page of a site.

The table gives a good overview about the collection. Clicking on the labels on the top the user can sort the table by that particular statistics. For example, in Figure 2, it is sorted by the authority scores. Some authorities for information on World Cup Soccer are shown at the top.

4.2. Visualizing Information at Various Levels of Abstraction

A logical site has several physical domains. The domains consists of directories and each directory has several Web pages and sub-directories. WTMS allows the users to view the information about a collection at various levels of abstraction depending on their focus. We discuss some of these visualizations in this subsection. Note that WTMS also allows the user to group related sites as discussed in section 5.

4.2.1. Site Level Views

	(a) The logical web site is selected
	(b) A directory within the site is selected

Figure 3: Visualizing the details of a logical site

Figure 3 shows the details of the site seawifs.gsfc.nasa.gov, the highest authority for the collection on Titanic. In Figure 3(a) the logical site itself is selected. The constituents of the site as well as sites having links to and from the selected site are shown as a glyphs (graphical elements). The figure shows that the logical site has several physical domains like rsd.gsfc.nasa.gov and seawifs.gsfc.nasa.gov. Notice that if a physical domain has just one directory or site (for example daac.gsfc.nasa.gov) then the glyph for that directory or page is shown. The brightness of a glyph is proportional to the last modification time of the freshest page in the group. Thus, bright red glyphs indicate groups which contain very fresh pages (for example, www.dimensional.com) and black glyphs indicates old pages (for example, www.nationalgeographic.com).

The view also shows the links to and from the currently selected site. If another site has a link to the site, it has an arrow pointing out. Similarly, if a site has a link from the selected site, it has an arrow pointing in. Thus, Figure 3(a) shows that www.marinemeuseum.org has both links to and from seawifs.gsfc.nasa.gov. The arrows to and from the children of the selected site, give an indication of their connectivity. For example, the figure shows that the domain rsd.gsfc.nasa.gov has only out-links while seawifs.gsfc.nasa.gov has both in-links and out-links. The thickness of the arrow is proportional to the number of links. On the other hand, the color of the arrows indicates whether the link is inter-site or intra-site. Green indicates inter-site links. Thus all links from pages in the domain rsd.gsfc.nasa.gov are to other pages within the same logical site. On the other hand, blue indicates inter-site links. Thus all links from the other sites are blue. A combination of inter-site and intra-site links is indicated by cyan. Thus seawifs.gsfc.nasa.gov has both inter-site and intra-site in-links and out-links.

The user can click on a glyph with the right mouse button to see more details. For example, in Figure 3(b) the user has navigated further down the hierarchy and selected the directory seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML. The relevant web pages in the directory are shown. The directory has a page with a large number of in-links and out-links, titanic.html. Note that the glyph is a circle for a Web page and rectangle for the groups. Further, the label of the currently selected node is highlighted and the rectangles representing the currently selected path is not filled.

4.2.2. Page Level Views

Figure 4: Information about the Web page www.cc.gatech.edu/gvu

Clicking on a glyph with the left mouse button shows information about the corresponding page or site. Thus, Figure 4 shows information about the Georgia Tech Graphics, Visualization & Usability Center Home Page, a page in the Information Visualization collection. This dialog box allows the user to visit the page, see the links to and from the page or see the related pages. For example, Figure 5 shows the url, title as well as the structural and semantic similarity values of pages related to the above page. The table is sorted by the semantic similarity of a page to the selected page which is determined by the vector space model [19]. The structural similarity of a page is determined by the links of the page with respect to the selected page and is calculated as follows:

For each direct link to and from any page B to A, the structural similarity score of B is increased by 1.0.
For each indirect link between pages B and A, the structural similarity score is increased by 0.5. There can be 3 types of indirect links:
1. Transitive: B links to C and C links to A or vice versa.
2. Social Filtering: Both A and B point to C.
3. Co-citation: C points to both A and B.
The scores are normalized to determine the structural similarity.

Figure 5: Related web pages for www.cc.gatech.edu/gvu

5. Analyzing the Site Graph

The WTMS interface allows the user to filter the information space based on various criteria. For example, the user can see only pages modified before or after a particular date. Moreover, in the Related Pages table (Figure 5), the user can see only pages that are co-cited or directly linked to the selected page. Another useful technique is to specify keywords to determine relevant Web pages or sites. For keyword queries, as well as for determining the similar pages to a selected pages, the WTMS interface sends a query to the WTMS server. The results from the server are used in the visualizations.

To make a Web Topic Management system more useful, it should provide analysis beyond traditional keyword or attribute based querying. At present WTMS provides various graph algorithm based techniques to analyze the topic-specific collection. The algorithms are applied on the site graph which is a directed graph with the logical sites as the nodes. If a page in site a has a link to a page in site b, then an edge is created from node a to b in the site graph. It should be emphasized that our analysis techniques are not very complex and thus applicable to collections with a large number of Web sites in real time.

5.1. Connected Components

One useful technique is to consider the site graph as an undirected graph and determine the connected components [10]. Nodes in the same connected component are connected by paths of one or more links and can thus be considered to be related.


(a) World Cup Soccer	(b) Titanic

Figure 6: The Connected Components of collections

Figure 6 shows the connected components that were discovered for the collections on World Cup Soccer and Titanic. If some connected component contained only one site, the site itself is shown (for example, www.iranian.com in Figure 6(a)). Each component is represented by a glyph whose size is proportional to the number of Web pages the group contains. The glyphs are ordered by the maximum authority score among the sites in the group. Thus the group containing seawifs.gsfc.nasa.gov is at the top for the Titanic collection. The label of a connected component is determined by the site with the highest authority score it contains. The user can click on a glyph to see the logical sites and web pages of the corresponding component. For example, in Figure 6(a), the user is seeing the details of a connected component containing the site www.ee.umd.edu.

Figure 6 shows that for both the collections only a few connected components were formed (even though the collections had thousands of pages). Thus the connected components is a useful mechanism to group the logical sites into meaningful clusters. The major information about the topic can be found in the top few components. The isolated sites found later in the view generally contain information different from the main theme of the collection. For example, as seen from Figure 6(a), we see that the Electrical Department of the University of Maryland is in the collection on World Cup Soccer. On examining the main page of the site (for this collection) www.ee.umd.edu/~dstewart/pinball/PAPA6/, we found that the page talks about the 1998 World Pinball Championship. The search engine retrieved the page, since one of the participating teams was named World Cup Soccer! Similarly, as discussed in section 3.2.2, the site www.murthy.com which talks about tax laws, is in the collection since one of its page talks about visa issuance to the Women's World Cup Soccer tournament. Further, for the Titanic collection shown in Figure 6(b), we found an isolated site for a casino named Titanic. Thus, it is evident that the connected components are effective in isolating pages that are in the fringes of the collection.

5.2. Strongly Connected Components

Figure 7: A strongly connected component in the Information Visualization collection

Considering the site graph as a directed graph we can also determine the strongly connected components [10]. Each pair of nodes of a strongly connected component have bi-directional links between them. Figure 7 shows a strongly connected components for the Information Visualization collection. All the sites shown in the figure are reachable from each other. For a strongly connected component with a large number of nodes and links, the graph is too complex for the user to understand. Therefore WTMS allows various ways to filter the graph:

Figure 8: Showing the paths from www-graphics.stanford.edu in the strongly connected component

The user can see the paths to or from a selected node to all other nodes in the graph. For example, Figure 8 shows the paths from the site www-graphics.stanford.edu, one of the best hubs on the topic.
We also allow the integration of keyword queries with structure-based analysis. For example, Figure 9 shows the same strongly connected component after a query with the keywords "stock market" . Sites relevant to visualization of stock markets are only shown. Notice that the site computer.org is isolated since the sites connecting it to the other sites were filtered out by the query.

Figure 9: The strongly connected component after a query with "stock market"

	(a) For the Information Visualization collection, 20%of the sites reference each other
	(b) For the World Cup Soccer collection, 3.5% of the sites reference each other

Figure 10: Information about the strongly collected components for a collection

The dialog box shown in Figure 10 gives an indication about the strongly connected components of a collection. The figure shows that for an academic topic like Information Visualization more sites are grouped into strongly connected components (20%) than for a topic of more general interest like World Cup Soccer (only 3.5%). On analyzing the collections using the WTMS interface, we determined that this is because the Information Visualization collection mainly consists of Web sites of research communities which sometimes cross-reference each other. On the other hand, the World Cup Soccer collection consists of several commercial Web sites on the topic as well as personal pages of people interested in the topic. The personal pages may point to some commercial sites but not vice versa and of course there is hardly any cyclical referencing among competing commercial sites. Thus, the number of sites that belong to a strongly connected component gives an indication about whether the collection is competitive or collaborative. In a collaborative collection, unlike a competitive collection, there will be several Web sites referencing each other; therefore these sites will be grouped into strongly connected components.

5.3. Optimum Hub/Authority Covers

For most topics, the collections will consist of hundreds of logical sites. Sometimes the user may want to filter the information space based on various criteria. Obviously, the hubs and authorities are some of the most important sites for the collection. So one option is to show the top n or n% hubs and authorities as the important sites, where n is any integer. However, instead of choosing an arbitrary integer, in some situations other techniques might be more appropriate. In this section we will define two techniques for filtering the information space.

We define a hub cover for a site graph with V sites and E links as a subset V_h of V, such that for all links (u,v) in E, u (the source of the link) belongs to V_h. An optimum hub cover is the hub cover of the smallest size for the given graph. That is, the optimum hub cover of a collection is the smallest number of sites that have links to all the sites of the collection. Filtering the collection by showing only the optimum hub cover is useful, because from these sites the user can reach all the sites of the collection.

On the other hand, the authority cover for a site graph with V sites and E links is a subset V_a of V, such that for all links (u,v) in E, v (the destination of the link) belongs to V_a. The optimum authority cover is the authority cover with the smallest size in the site graph. That is, the optimum authority cover of a collection is the smallest number of sites that have links from all the sites of the collection. Obviously, filtering the collection by the optimum authority cover is also useful.

Determining the optimum hub and authority covers for a site graph is similar to the vertex cover problem [10]. The vertex cover problem for an undirected graph G=(V,E) is to determine the smallest possible subset V' of V such that if (u,v) is in E, then at least one of u or v is in V'. Unfortunately, the vertex cover problem is NP-complete [10].

In WTMS we determine the approximate optimum hub and authority covers. The algorithm to determine the approximate optimum hub cover is as follows:

Let G=(V,E) be the site graph
V_h = ()
Let V_s be the nodes of the graph sorted by their increasing in-degrees
Remove from V_s all nodes with no incoming links
While V_s is not empty,
- Let v be the next node of V_s
- Let u be the node with the highest hub score among the nodes that have links to v
- Add u to V_h
- Remove from V_s, u as well as all nodes that have links from u
Return V_h as the approximate optimum hub cover

The algorithm examines the nodes of the graph sorted by their increasing in-degress. For nodes having just one in-link, the source of the link has to be added to the hub cover. For nodes having more than one link, the algorithm adds to the hub cover the link source with the highest hub score. Whenever a node is added to the hub cover, all nodes that have links from that node can be ignored.

Notice that even though we remove the sites with no in-links before the while loop in the above algorithm, these sites can still be in a hub cover. However, sites with both no in-links and out-links will be ignored by the algorithm. Since these sites are in the fringes of the collection (as discussed in Section 5.1), they should not be considered important to the collection.

The approximate optimum authority cover can be determined by a similar algorithm. We can also determine the hub and authority covers for the Web pages by applying the algorithms on the original graph of the collection.

Collections	Information Visualization	World Cup Soccer	Titanic
Number of Sites	305	140	528
Size of Approximate Optimal Hub Cover	35	20	52
Size of Approximate Optimal Authority Cover	28	18	68

Table 2: Statistics for determining approximate optimum hub and authority covers

Table 2 shows the size of the approximate optimal hub and authority covers that were discovered for the collections. In all cases we could discover a few sites, significantly smaller than the total number of sites, which have links to/from all the sites of the collections.


(a) Hub Cover	(b) Authority Cover	(c) Nodes directly reachable from the main hub www.anancyweb.com

Figure 11: The Hub and Authority Covers for the World Cup Soccer collection

Figure 11(a) shows the approximate optimal hub cover for the World Cup Soccer collection, sorted by the hub scores. Similarly, Figure 11(b) shows the approximate optimal authority cover sorted by the authority scores. These views are useful because starting from these sites the user can visit all the sites of the collection. Clicking on one of the sites in the Hub cover view shows the sites that can be directly reached from the selected site. Thus, Figure 11(c) shows the sites with links from the main hub for the World Cup Soccer collection www.anancyweb.com. Similarly, clicking on a site in the Authority cover view shows the sites that have links to the selected site.

6. Conclusion

In this paper, we have presented WTMS, a Web Topic Management system. WTMS uses a crawler to gather information related to a particular topic from the WWW. The enhancements to the crawler has resulted in improved performance by reducing the number of pages that need to be downloaded by more than 20% while missing only a few insignificant pages.

The WTMS interface provides various visualizations to navigate through the information space at different levels of abstraction. For example, the user can see the details of a logical site or a Web page. The system allows the user to integrate searching and browsing. Besides traditional keyword-based search, structural analysis techniques are provided which allow the user to gain several useful insights about the collection. This kind of analysis is not easy using traditional search engines or the previous systems for topic management. We have also introduced the concept of optimum hub and authority covers as a technique to filter the information space. The example scenarios presented in the paper indicates the usefulness of the system for Web Topic Management.

Future work is planned along various directions:

Incorporating Other Analysis Techniques: We need to integrate other techniques of analysis to enable the users to understand the information space from different perspectives. For example, we may determine the cores or bi-partite sub graphs in the site graph to understand the communities within the collection as discussed in [14]. Techniques that examine the contents of the collection are also required.
Allowing Collaboration: We plan to allow distributed users to collaborate over the internet while navigating through the same information space. For example, colleagues can collaborate while performing literature survey on a particular topic. Our client-server architecture will facilitate such a collaborative environment.
Usability Studies: Although the initial reactions to the system has been positive, more formal user studies are required to determine the effectiveness of the WTMS system for analyzing information about a particular topic. Such feedback will help us to improve the system.

We believe that as the WWW grows bigger, systems like the WTMS will become essential for retrieving useful information.

References

B. Amento, W. Hill, L. Terveen, D. Hix, and P. Ju.
An Empirical Evaluation of User Interfaces for Topic Management of Web Sites.
In Proceedings of the ACM SIGCHI '99 Conference on Human Factors in Computing Systems, pages 552-559, Pittsburgh, Pa, May 1999.
K. Andrews.
Visualizing Cyberspace: Information Visualization in the Harmony Internet Browser.
In Proceedings of the 1995 Information Visualization Symposium, pages 97-104, Atlanta, GA, 1995.
I. Ben-Shaul, M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalheim, V. Soroka, and S. Ur.
Adding Support for Dynamic and Focussed Search with Fetuccino.
In Proceedings of the Eight International World-Wide Web Conference, pages 575-588, Toronto, Canada, May 1999.
K. Bharat and A. Broder.
A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines .
Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April 1998.
K. Bharat and M. Henzinger.
Improved Algorithms for Topic Distillation in a Hyperlinked Environment.
In Proceedings of the ACM SIGIR '98 Conference on Research and Development in Information Retrieval, pages 104-111, Melbourne, Australia, August 1998.
S. Card, G. Robertson, and W. York.
The WebBook and the Web Forager: An Information Workspace for the World-Wide Web.
In Proceedings of the ACM SIGCHI '96 Conference on Human Factors in Computing Systems, pages 112-117, Vancouver, Canada, April 1996.
J. Carriere and R. Kazman.
Searching and Visualizing the Web through Connectivity.
In Proceedings of the Sixth International World-Wide Web Conference, pages 701-711, Santa Clara, CA, April 1997.
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan.
Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text.
Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April 1998.
S. Chakrabarti, M. van den Berg, and B. Dom.
Focussed Crawling: a New Approach to Topic-specific Web Resource Discovery.
In Proceedings of the Eight International World-Wide Web Conference, pages 545-562, Toronto, Canada, May 1999.
T. Cormen, C. Leiserson, and R. Rivest.
Introduction to Algorithms.
The MIT Press, 1992.
R. Hendley, N. Drew, A. Wood, and R. Beale.
Narcissus: Visualizing Information.
In Proceedings of the 1995 Information Visualization Symposium, pages 90-96, Atlanta, GA, 1995.
M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalheim, and S. Ur.
The Shark-Search Algorithm - an Application: Tailored Web Site Mapping.
Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April 1998.
J. Kleinberg.
Authoritative Sources in a Hyperlinked Environment.
In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, May 1998.
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.
Trawling the Web for Emerging Cyber-communities.
In Proceedings of the Eight International World-Wide Web Conference, pages 403-415, Toronto, Canada, May 1999.
Y. Maarek and I. Shaul.
WebCutter: A System for Dynamic and Tailorable Site Mapping.
In Proceedings of the Sixth International World-Wide Web Conference, pages 713-722, Santa Clara, CA, April 1997.
S. Mukherjea and J. Foley.
Visualizing the World-Wide Web with the Navigational View Builder.
Computer Networks and ISDN Systems. Special Issue on the Third International World-Wide Web Conference, Darmstadt, Germany, 27(6):1075-1087, April 1995.
S. Mukherjea and Y. Hara.
Focus+Context Views of World-Wide Web Nodes.
In Proceedings of the Eight ACM Conference on Hypertext, pages 187-196, Southampton, England, April 1997.
S. Mukherjea
Organizing Topic-Specific Web Information.
To appear in Proceedings of the Eleventh ACM Conference on Hypertext, San Antonio, TX, May 2000.
G. Salton and M. McGill.
Introduction to Modern Information Retrieval.
McGraw-Hill, 1983.
L. Terveen and H. Will.
Finding and Visualizing Inter-site Clan Graphs.
In Proceedings of the ACM SIGCHI '98 Conference on Human Factors in Computing Systems, pages 448-455, Los Angeles, Ca, April 1998.

Vitae

Sougata Mukherjea received his Bachelors in Computer Science & Engineering from Jadavpur University, Calcutta, India in 1988 and MS in Computer Science from Northeastern University, Boston, Ma in 1991. He then obtained his PhD in Computer Science from Georgia Institute of Technology, Atlanta, Ga in 1995. Till 1999 he was working at NEC's C&C Research Laboratories in San Jose, Ca. Presently he is working for Icarian, a start-up company in Sunnyvale, Ca. His main research interests are in Web Technologies like crawling and link analysis, Information Retrieval and Informtion Visualization. He can be reached by e-mail at [email protected] or [email protected].

你可能感兴趣的:(System)

[论文阅读] 软件工程 | 需求工程中领域知识研究：系统映射与创新突破张较瘦_ 前沿技术论文阅读软件工程
需求工程中领域知识研究：系统映射与创新突破论文信息DomainKnowledgeinRequirementsEngineering:ASystematicMappingStudyarXiv:2506.20754DomainKnowledgeinRequirementsEngineering:ASystematicMappingStudyMarinaAraújo,JúliaAraújo,RomeuO
android15 修改默认亮度及关闭自动亮度调节 darling_user Android android
默认亮度1、先把设备亮度手动调整到对应亮度2、通过adb命令获取对应亮度值adbshellsettingsgetsystemscreen_brightness3、代码上修改默认亮度值SYS/frameworks/base/core/res/res/values/config.xml-102+255SYS/frameworks/base/packages/SettingsProvider/res/v
深入理解提示词工程：原理、分类与实战应用小雷FansUnion AI2025 人工智能
一、什么是提示词工程（PromptEngineering）提示词工程是指通过设计和优化与大模型（如ChatGPT、文心一言等）交互的“提示词（Prompt）”，以获得更准确、更高效、更符合预期的模型输出结果的技术和方法。它是大模型应用开发中的核心环节。二、提示词的主要类型1.系统提示词（SystemPrompt）由开发者或系统设定，通常在对话开始时就注入，定义AI的角色、行为边界、风格、输出格式等
c# 利用 GZipStream 压缩解压缩文件（所有类型的文档) 山海上的风 c#
c#利用GZipStream压缩解压缩文件（所有类型的文档)usingSystem;usingSystem.Collections.Generic;usingSystem.IO;usingSystem.IO.Compression;usingSystem.Linq;usingSystem.Text;usingSystem.Threading.Tasks;namespaceGZipStream_压缩
ADIOS2 介绍与使用指南东北豆子哥 HPC/MPI HPC
文章目录ADIOS2介绍与使用指南什么是ADIOS2?ADIOS2的主要特点ADIOS2核心概念ADIOS2安装Linux系统安装Windows安装ADIOS2基本使用C++示例Python示例ADIOS2高级特性并行I/O流模式ADIOS2引擎类型性能优化建议总结ADIOS2介绍与使用指南什么是ADIOS2?ADIOS2(AdaptableInputOutputSystemversion2)是一
Maven 如何引入外部依赖jar包
1、在src目录下创建libs目录，并将需要引入的jar包放到lib目录下2、然后添加以下依赖到pom.xml文件中com.cryptoFrontcryptoFrontsystem1.0.0${project.basedir}/libs/cryptofront-2.1.8.jar3、点击idea中项目结构3、选择库，点击新建项目库，找到libs位置添加并应用保存4、此时看到这里就可以看出jar包就
PyTorch study notes[4]
文章目录thesystemofequationsreferencesthesystemofequationsthedefinitionofmatrixwithmathematicalform.thefollowingsamplecodeexpressesthemaxtrixandsquarematrix.importtorch#从Python列表创建矩阵matrix=torch.tensor([[
Android 跨进程通信(IPC)深度技术总结 JT-Blink Android android
1.概述Android系统基于Linux内核，采用多进程架构设计。每个Android应用默认运行在独立的进程中，拥有独立的虚拟机实例和内存空间。进程间的内存隔离机制保证了系统的稳定性和安全性，但同时也带来了进程间通信的挑战。1.1为什么需要跨进程通信系统架构需求：Android系统服务（如ActivityManagerService、WindowManagerService）运行在system_s
VB.NET Socket TCP服务器和客户端 DonovanZxq PC VB.NET SOCKET c#websocket tcp/ip
多线程,1服务器,多客户端可以有多个客户端连入服务器，服务器对所有客户端群发。模拟实验使用场景：多个客户端申请服务器TCP连接,服务器把自己的数据，比如压力，温度等发送给所有的客户端（比如工程师站，现场监控屏幕等）服务器：FORM代码ImportsSystem.TextPublicClassForm1PrivatemessageAsStringPrivateWithEventsmodbusTcpS
JAVA LIST＜Long＞快速转LIST＜String＞ LeeShaoQing java 学习 java
偶然间发现一个问题，获取List传给前端，拿到的值最后两位变成了00。这是因为当Long过长时，到前端数据拉取后几位可能会自动变成0，所以要先处理成String发给前端。ListbindingList=systemSiteExpensesConfigService.getBindingServiceType(bindingServiceTypeDTO);Liststrings=bindingLis
C#使用ExcelDataReader高效读取excel文件写入数据库香煎三文鱼 .net core .Net6 C#C#读取excel
分享一个库ExcelDataReader，它专注读取、支持.xls/.xlsx、内存优化。首先安装NuGet包dotnetaddpackageExcelDataReaderdotnetaddpackageSystem.Text.Encoding.CodePages编码内存优化：每次仅读取一行，适合处理百万级数据。类型安全方法：可用GetString(0)、GetDouble(1)等强类型方法（需确
java 学习底层代码算法好学且牛逼的马 java
#33写算法题黑马的视频争取简单的过一遍要考试啦密码的写底层代码秘密的底层代码有点长啊看不懂难找了几个视频课看看吧想看中文版jdkapi吧算了慢慢看先把几个顶级父类给看会了objectsystemstringstringbuilder算法单路递归packagecom.itheima.Recursion;publicclasssingleRecursion{ publicstaticvoidma
Cadence Design Systems EDA介绍（五）--Innovus 小蘑菇二号笔记
目录Innovus的主要功能1.初始布局规划（Floorplanning）2.详细布局（Placement）3.布线（Routing）4.时序分析与优化（TimingAnalysisandOptimization）5.功耗分析与优化（PowerAnalysisandOptimization）6.面积优化（AreaOptimization）7.签核（Sign-off）Innovus的特点1.高性能2
C# 与串口通信：解决常见问题的调试技巧与实用建议威哥说编程 c#单片机 stm32
串口通信作为一种经典的通信方式，在很多领域中仍然广泛应用，尤其是在嵌入式系统、工业自动化、测控系统等场景中。通过串口接口，可以实现设备间的短距离、低速数据传输。C#提供了强大的System.IO.Ports.SerialPort类来支持串口通信的开发，但在实际开发中，开发者常常遇到一些问题，比如数据丢失、串口冲突、波特率不匹配等。本文将深入探讨如何使用C#进行串口通信，结合调试技巧和实用建议，帮助
Docker+Portainer 离线安装 qq_30024063 docker 容器运维
1.Docker安装步骤一：官网下载docker安装包步骤二：解压安装包;tar-zxvfdocker-24.0.6.tgz步骤三：将解压之后的docker文件移到/usr/bin目录下;cpdocker/*/usr/bin/步骤四：将docker注册成系统服务;vim/etc/systemd/system/docker.service然后在文件中添加以下内容，退出并保存（:wq!）[Unit]D
如何用Docker部署Mysql 小楠小楠小楠 docker mysql 容器
1.安装Docker确保已安装Docker，并启动Docker服务。Linux：bash复制sudoaptupdatesudoaptinstalldocker.iosudosystemctlstartdockersudosystemctlenabledocker2.拉取MySQL镜像从DockerHub拉取官方MySQL镜像。bash复制dockerpullmysql:latest3.启动MySQ
C++程序实现阻止屏保、阻止系统自动关闭屏幕、阻止系统待机（附源码） dvlinker C/C++实战专栏阻止屏保阻止系统自动关闭屏幕阻止系统待机 API Monitor
目录1、概述2、设置屏幕保护程序，修改自动关闭显示器和待机的时间2.1、设置屏保程序2.2、修改自动关闭显示器和待机的时间3、通过屏保的通知消息来阻止屏保4、调用API函数SystemParametersInfo关闭/启用屏保，但存在问题4.1、初步确定处理策略4.2、启动监控进程去监控主进程4.3、系统强行关机的情况无法处理5、使用APIMonitor监测到目标程序对API的调用，找到了问题的突
微信小程序防录频截屏ios+android Yannnnnm ios 微信小程序 android
ios截屏暂未解决安卓录频截屏代码可以写在onshow中//安卓防止截屏录屏if(/android/i.test(wx.getSystemInfoSync().system)&&wx.setVisualEffectOnCapture){wx.setVisualEffectOnCapture({visualEffect:'hidden',complete:function(res){//wx.sho
Excel数据导出小记焚城记录总结 EXCEL插件 excel .net
文章目录前言一、DataTable=>EXCEL二、DBReader=>Excel（NPOI）三、分页查询DbReader=>Excel(MiniExcel)总结：前言最近经历了一次数据量比较大的导出，也做了各种优化尝试，这里稍记录一下一、DataTable=>EXCELusingSystem;usingSystem.Collections.Generic;usingSystem.IO;using
中国计算机学会（CCF）推荐学术会议-C（软件工程/系统软件/程序设计语言）：FPT 2025 爱思德学术 AI编程极限编程重构
FPT2025FPTisthepremierconferenceintheAsia-Pacificregiononfield-programmabletechnologies,reconfigurablecomputingdevicesandsystems.Field-programmabledevicesoffertheflexibilityofsoftwarewiththeperformanc
2019 CCF 推荐国际学术期刊&会议（计算机体系结构/并行与分布计算/存储系统）漓艾初 CCF
中国计算机学会推荐国际学术期刊&会议直接去这里找，全部都有https://www.ccf.org.cn/Academic_Evaluation/By_category/计算机体系结构/并行与分布计算/存储系统期刊A类序号刊物简称刊物全称出版社网址1TOCSACMTransactionsonComputerSystemsACMhttp://dblp.uni-trier.de/db/journals/
Wpf之命名空间！ weixin_44710358 Wpf wpf c#开发语言
文章目录前言一、命名空间二、命名空间讲解总结前言Wpf之命名空间！一、命名空间我们的程序中有许多的命名空间，例如一个程序中有Window类–Window类可能是指System.Windows.Window类,也可能是指位于第三方组件中的Window类，或您自己在应用程序中定义的Window类等。为了弄清你实际使用的是哪个类，XAML解析器会检查应用于元素的XML名称空间。二、命名空间讲解第一行代码
IntelliJ IDEA 路径问题总结：如何配置并显示当前工作目录 2301_79306982 开发语言 java intellij-idea
问题一：如何查看和配置IntelliJIDEA的工作目录工作目录（WorkingDirectory）决定了相对路径的起点当前工作目录究竟是什么？如何在IntelliJIDEA中验证和配置工作目录？解决方法通过代码显示当前工作目录使用以下代码打印运行时的工作目录：System.out.println("Currentworkingdirectory:"+System.getProperty("use
力扣代码错题记录 0319zz leetcode 算法
1.排序1.1快排不稳定的排序方法：[5,3A,6,3B]->[3B,3A,5,6]publicclassMain{publicstaticvoidmain(String[]args){Scannerscanner=newScanner(System.in);Stringstring=scanner.nextLine();String[]split=string.trim().split("");
Process.Start 方法的五个重载小柚子~ C#c#
一、system.Diagnostics.Product.start方法主要用来启动一个进程资源，例如：打开一个exe，根据路径打开一个文件夹，打开一个txt，我工作项目中主要是用来打开exe。二、start重载微软提供了5个方法重载Process.Start方法(System.Diagnostics)|MicrosoftDocs：1、Start(stringfilename,stringargu
【数字IC前端笔试真题精刷（2022.7.28）】芯动——数字IC验证工程师（1号卷-验证） ReRrain #数字IC 笔试
声明：本专栏所收集的数字IC笔试题目均来源于互联网，仅供学习交流使用。如有侵犯您的知识产权，请及时与博主联系，博主将会立即删除相关内容。笔试时间：2022-7-28；题目类型：不定项（10x1’=10’）【错选不得分，少选得1/3分】问答（9x10’=90’）文章目录不定项1、(单选)在verilog语言中，a=4'b1011，那么&a=()2、(单选)SystemVerilog中类默认的成员属性
Mac python3.12 执行pip/pip3异常externally-managed-environment 翱翔的赖思 macos pip python
环境：Mac、Python3.12.x版本（3.12.5）问题：执行pipinstallxyz后出现异常：error:externally-managed-environment×Thisenvironmentisexternallymanaged╰─>ToinstallPythonpackagessystem-wide,trybrewinstallxyz,wherexyzisthepackage
谷歌地图的3d街景使用的是什么数据格式？奇树谦 experience 3d 三维显示
文章目录一、3D街景（StreetView）1.图像部分2.元数据（Metadata）️二、3D城市模型（GoogleEarth或Maps的倾斜摄影模型）1.模型部分2.瓦片划分（TilingSystem）3.材质贴图注意与标准格式对比（参考）✅一、Google3DMesh使用的格式（Protobuf+Binary）1.**数据结构**2.**典型组成**✅二、glTF（GLTransmissio
ROS2 强化学习：案例与代码实战芯动大师 ROS2学习目标检测人工智能
一、引言在机器人技术不断发展的今天，强化学习（RL）作为一种强大的机器学习范式，为机器人的智能决策和自主控制提供了新的途径。ROS2（RobotOperatingSystem2）作为新一代机器人操作系统，具有更好的实时性、分布式性能和安全性，为强化学习在机器人领域的应用提供了更坚实的基础。本文将通过一个具体案例，深入探讨ROS2与强化学习的结合应用，并提供相关代码实现。二、案例背景本案例以移动机器
95.mysql5.7/MySQL8.0root密码忘记重置戒掉贪嗔痴数据库运维-MySQL mysql
1.mysql5.7密码重置方法mysql5.7--加参数vimy.cnf[mysqld]skip-grant-tables--重启systemctlrestartmysqld--修改密码USEmysql;FLUSHPRIVILEGES;UPDATEuserSETauthentication_string=PASSWORD('S3#1234')WHEREUser='root';EXIT;--注释v
ASM系列四利用Method 组件动态注入方法逻辑 lijingyao8206 字节码技术 jvm AOP 动态代理 ASM
这篇继续结合例子来深入了解下Method组件动态变更方法字节码的实现。通过前面一篇，知道ClassVisitor 的visitMethod()方法可以返回一个MethodVisitor的实例。那么我们也基本可以知道，同ClassVisitor改变类成员一样，MethodVIsistor如果需要改变方法成员，注入逻辑，也可以
java编程思想 --内部类百合不是茶 java 内部类匿名内部类
内部类;了解外部类并能与之通信内部类写出来的代码更加整洁与优雅 1,内部类的创建内部类是创建在类中的 package com.wj.InsideClass; /* * 内部类的创建 */ public class CreateInsideClass { public CreateInsideClass(
web.xml报错 crabdave web.xml
web.xml报错 The content of element type "web-app" must match "(icon?,display- name?,description?,distributable?,context-param*,filter*,filter-mapping*,listener*,servlet*,s
泛型类的自定义麦田的设计者 java android 泛型
为什么要定义泛型类，当类中要操作的引用数据类型不确定的时候。采用泛型类，完成扩展。例如有一个学生类 Student{ Student(){ System.out.println("I'm a student....."); } } 有一个老师类
CSS清除浮动的4中方法 IT独行者 JavaScript UI css
清除浮动这个问题，做前端的应该再熟悉不过了，咱是个新人，所以还是记个笔记，做个积累，努力学习向大神靠近。CSS清除浮动的方法网上一搜，大概有N多种，用过几种，说下个人感受。 1、结尾处加空div标签 clear:both 1 2 3 4 .div 1 { background : #000080 ; border : 1px s
Cygwin使用windows的jdk 配置方法 _wy_ jdk windows cygwin
1.[vim /etc/profile] JAVA_HOME="/cgydrive/d/Java/jdk1.6.0_43" (windows下jdk路径为D:\Java\jdk1.6.0_43) PATH="$JAVA_HOME/bin:${PATH}" CLAS
linux下安装maven 无量 maven linux 安装
Linux下安装maven(转) 1.首先到Maven官网下载安装文件，目前最新版本为3.0.3，下载文件为 apache-maven-3.0.3-bin.tar.gz，下载可以使用wget命令； 2.进入下载文件夹，找到下载的文件，运行如下命令解压 tar -xvf apache-maven-2.2.1-bin.tar.gz 解压后的文件夹
tomcat的https 配置,syslog-ng配置 aichenglong tomcat http跳转到https syslong-ng配置 syslog配置
1) tomcat配置https,以及http自动跳转到https的配置 1)TOMCAT_HOME目录下生成密钥(keytool是jdk中的命令) keytool -genkey -alias tomcat -keyalg RSA -keypass changeit -storepass changeit
关于领号活动总结 alafqq 活动
关于某彩票活动的总结具体需求，每个用户进活动页面，领取一个号码，1000中的一个；活动要求 1，随机性，一定要有随机性； 2，最少中奖概率，如果注数为3200注，则最多中4注 3，效率问题，（不能每个人来都产生一个随机数，这样效率不高）； 4，支持断电（仍然从下一个开始），重启服务；（存数据库有点大材小用，因此不能存放在数据库）解决方案 1，事先产生随机数1000个，并打
java数据结构冒泡排序的遍历与排序百合不是茶 java
java的冒泡排序是一种简单的排序规则冒泡排序的原理：比较两个相邻的数，首先将最大的排在第一个，第二次比较第二个，此后一样；针对所有的元素重复以上的步骤，除了最后一个例题；将int array[]
JS检查输入框输入的是否是数字的一种校验方法 bijian1013 js
如下是JS检查输入框输入的是否是数字的一种校验方法： <form method=post target="_blank"> 数字：<input type="text" name=num onkeypress="checkNum(this.form)"><br> </form>
Test注解的两个属性：expected和timeout bijian1013 java JUnit expected timeout
JUnit4：Test文档中的解释：　　The Test annotation supports two optional parameters. 　　The first, expected, declares that a test method should throw an exception. 　　If it doesn't throw an exception or if it
[Gson二]继承关系的POJO的反序列化 bit1129 POJO
父类 package inheritance.test2; import java.util.Map; public class Model { private String field1; private String field2; private Map<String, String> infoMap
【Spark八十四】Spark零碎知识点记录 bit1129 spark
1. ShuffleMapTask的shuffle数据在什么地方记录到MapOutputTracker中的 ShuffleMapTask的runTask方法负责写数据到shuffle map文件中。当任务执行完成成功，DAGScheduler会收到通知，在DAGScheduler的handleTaskCompletion方法中完成记录到MapOutputTracker中
WAS各种脚本作用大全 ronin47 WAS 脚本
　　　http://www.ibm.com/developerworks/cn/websphere/library/samples/SampleScripts.html 　　　无意中，在WAS官网上发现的各种脚本作用，感觉很有作用，先与各位分享一下　　　获取下载这些示例 jacl 和 Jython 脚本可用于在 WebSphere Application Server 的不同版本中自
java-12.求 1+2+3+..n不能使用乘除法、 for 、 while 、 if 、 else 、 switch 、 case 等关键字以及条件判断语句 bylijinnan switch
借鉴网上的思路，用java实现： public class NoIfWhile { /** * @param args * * find x=1+2+3+....n */ public static void main(String[] args) { int n=10; int re=find(n); System.o
Netty源码学习-ObjectEncoder和ObjectDecoder bylijinnan java netty
Netty中传递对象的思路很直观： Netty中数据的传递是基于ChannelBuffer（也就是byte[]）；那把对象序列化为字节流，就可以在Netty中传递对象了相应的从ChannelBuffer恢复对象，就是反序列化的过程 Netty已经封装好ObjectEncoder和ObjectDecoder 先看ObjectEncoder ObjectEncoder是往外发送
spring 定时任务中cronExpression表达式含义 chicony cronExpression
一个cron表达式有6个必选的元素和一个可选的元素，各个元素之间是以空格分隔的，从左至右，这些元素的含义如下表所示：代表含义是否必须允许的取值范围 &nb
Nutz配置Jndi ctrain JNDI
1、使用JNDI获取指定资源： var ioc = { dao : { type :"org.nutz.dao.impl.NutDao", args : [ {jndi :"jdbc/dataSource"} ] } } 以上方法,仅需要在容器中配置好数据源,注入到NutDao即可.
解决 /bin/sh^M: bad interpreter: No such file or directory daizj shell
在Linux中执行.sh脚本，异常/bin/sh^M: bad interpreter: No such file or directory。分析：这是不同系统编码格式引起的：在windows系统中编辑的.sh文件可能有不可见字符，所以在Linux系统下执行会报以上异常信息。解决： 1）在windows下转换：利用一些编辑器如UltraEdit或EditPlus等工具
[转]for 循环为何可恨？ dcj3sjt126com 程序员读书
Java的闭包(Closure)特征最近成为了一个热门话题。一些精英正在起草一份议案，要在Java将来的版本中加入闭包特征。然而，提议中的闭包语法以及语言上的这种扩充受到了众多Java程序员的猛烈抨击。不久前，出版过数十本编程书籍的大作家Elliotte Rusty Harold发表了对Java中闭包的价值的质疑。尤其是他问道“for 循环为何可恨？”[http://ju
Android实用小技巧 dcj3sjt126com android
1、去掉所有Activity界面的标题栏　　修改AndroidManifest.xml 　　在application 标签中添加android:theme="@android:style/Theme.NoTitleBar" 2、去掉所有Activity界面的TitleBar 和StatusBar 　　修改AndroidManifes
Oracle 复习笔记之序列 eksliang Oracle 序列 sequence Oracle sequence
转载请出自出处：http://eksliang.iteye.com/blog/2098859 1.序列的作用序列是用于生成唯一、连续序号的对象一般用序列来充当数据库表的主键值 2.创建序列语法如下： create sequence s_emp start with 1 --开始值 increment by 1 --増长值 maxval
有“品”的程序员 gongmeitao 工作
完美程序员的10种品质　　完美程序员的每种品质都有一个范围，这个范围取决于具体的问题和背景。没有能解决所有问题的完美程序员（至少在我们这个星球上），并且对于特定问题，完美程序员应该具有以下品质：　　1. 才智非凡- 能够理解问题、能够用清晰可读的代码翻译并表达想法、善于分析并且逻辑思维能力强（范围：用简单方式解决复杂问题）　　
使用KeleyiSQLHelper类进行分页查询 hvt sql .net C#asp.net hovertree
本文适用于sql server单主键表或者视图进行分页查询，支持多字段排序。KeleyiSQLHelper类的最新代码请到http://hovertree.codeplex.com/SourceControl/latest下载整个解决方案源代码查看。或者直接在线查看类的代码：http://hovertree.codeplex.com/SourceControl/latest#HoverTree.D
SVG 教程（三）圆形，椭圆，直线天梯梦 svg
SVG <circle> SVG 圆形 - <circle> <circle> 标签可用来创建一个圆：下面是SVG代码： <svg xmlns="http://www.w3.org/2000/svg" version="1.1"> <circle cx="100" c
链表栈 luyulong java 数据结构
public class Node { private Object object; private Node next; public Node() { this.next = null; this.object = null; } public Object getObject() { return object; } public
基础数据结构和算法十：2-3 search tree sunwinner Algorithm 2-3 search tree
Binary search tree works well for a wide variety of applications, but they have poor worst-case performance. Now we introduce a type of binary search tree where costs are guaranteed to be loga
spring配置定时任务 stunizhengjia spring timer
最近因工作的需要，用到了spring的定时任务的功能,觉得spring还是很智能化的,只需要配置一下配置文件就可以了,在此记录一下，以便以后用到： //------------------------定时任务调用的方法------------------------------ /** * 存储过程定时器 */ publi
ITeye 8月技术图书有奖试读获奖名单公布 ITeye管理员活动
ITeye携手博文视点举办的8月技术图书有奖试读活动已圆满结束，非常感谢广大用户对本次活动的关注与参与。 8月试读活动回顾： http://webmaster.iteye.com/blog/2102830 本次技术图书试读活动的优秀奖获奖名单及相应作品如下（优秀文章有很多，但名额有限，没获奖并不代表不优秀）：《跨终端Web》 gleams：http