Adam坤

ArnetMiner: Extraction and Mining of Academic Social Networks

ABSTRACT

This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. Speciﬁcally, the system focuses on: 1) Extracting researcher proﬁles automatically from the Web; 2) Integrating the publication data into the network from existing digital libraries; 3) Modeling the entire academic network; and 4) Providing search services for the academic network. So far, 448,470 researcher proﬁles have been extracted using a uniﬁed tagging approach. We integrate publications from online Web databases and propose a probabilistic framework to deal with the name ambiguity problem. Furthermore, we propose a uniﬁed modeling approach to simultaneously model topical aspects of papers, authors, and publication venues. Search services such as expertise search and people association search have been provided based on the modeling results. In this paper,we describe the architecture and main features of the system. We also present the empirical evaluation of the proposed methods.

CategoriesandSubjectDescriptors

H.3.3 [Information Search and Retrieval]: Text Mining, Digital Libraries; H.2.8 [DatabaseManagement]: Database Applications

GeneralTerms

Algorithms, Experimentation

Keywords

Social Network, Information Extraction, Name Disambiguation, Topic Modeling, Expertise Search, Association Search

1. INTRODUCTION

Extraction and mining of academic social networks aims at providing comprehensive services in the scientiﬁc research ﬁeld. In an academic social network, people are not only interested in searching for different types of information(such as authors,conferences, and papers), but are also interested in ﬁnding semantics-based information (such as structured researcher proﬁles).

Many issues in academic social networks have been investigated and several systems have been developed (e.g., DBLP, CiteSeer, and Google Scholar). However, the issues were usually studied separately and the methods proposed are not sufficient for mining the entire academic network. Two reasons are as follows: 1) Lack of semantics-based information. The social information obtained from user-entered profiles or by extraction using heuristics is sometimes incomplete or inconsistent; 2) Lack of a unified approach to efficiently model the academic network. Previously, different types of information in the academic network were modeled individually, thus dependencies between them cannot be captured accurately. In this paper, we try to address the two challenges in novel approaches. We have developed an academic search system, called ArnetMiner (http://www.arnetminer.org). Our objective in this system is to answer the following questions: 1) how to automatically extract researcher profiles from the Web? 2) how to integrate the extracted information (e.g., researchers’ profiles and publications) from different sources? 3) how to model different types of information in a unified approach? and 4) how to provide powerful search services based on the constructed network?
(1) We extend the Friend-Of-A-Friend (FOAF) ontology [9] as the profile schema and propose a unified approach based on Conditional Random Fields to extract researcher profiles from the Web.
(2) We integrate the extracted researcher profiles and the crawled publication data from the online digital libraries. We propose a unified probabilistic framework for dealing with the name ambiguity problem in the integration.
(3) We propose three generative probabilistic models for simultaneously modeling topical aspects of papers, authors, and publication venues.
(4) Based on the modeling results, we implement several search services such as expertise search and association search. We conducted empirical evaluations of the proposed methods. Experimental results show that our proposed methods significantly outperform the baseline methods for dealing with the above issues.
Our contributions in this paper include: (1) a proposal of a unified tagging approach to researcher profile extraction, (2) a proposal of a unified probabilistic framework to name disambiguation, and (3) a proposal of three probabilistic topic models to simultaneously model the different types of information. The paper is organized as follows. In Section 2, we review the related work. In Section 3, we give an overview of the system. In Section 4, we present our approach to researcher profiling. In Section 5, we describe the probabilistic framework to name disambiguation. In Section 6, we propose three generative probabilistic models to model the academic network. Section 7 illustrates several search services provided in ArnetMiner based on the modeling
results. We conclude the paper in Section 8.

2. RELATED WORK

2.1 Person Profile Extraction
Several research efforts have been made for extracting person profiles. For example, Yu et al. [32] propose a two-stage extraction method for identifying personal information from resumes. The first stage segments a resume into different types of blocks and the second stage extracts the detailed information such as Address and Email from the identified blocks. However, the method formalizes the profile extraction as several separate steps and conducts extraction in a more or less ad-hoc manner.
A few efforts also have been placed on the extraction of contact information from emails or from the Web. For example, Kristjansson et al. [19] have developed an interactive information extraction system to assist the user to populate a contact database from emails.
In comparison, profile extraction consists of contact information extraction as well as other different subtasks.

2.2 Name Disambiguation

A number of approaches have been proposed to name disambiguation. For example, Bekkerman and McCallum [6] present two unsupervised methods to distinguish Web pages to different persons with the same name: one is based on the link structure of the Web pages and the other is based on the textural content. However,
the methods cannot incorporate the relationships between data.
Han et al. [15] propose an unsupervised learning approach using K-way spectral clustering. Tan et al. [27] propose a method for name disambiguation based on hierarchical clustering. However, this kind of methods cannot capture the relationships either. Two supervised methods are proposed by Han et al. [14]. For each given name, the methods learn a specific classification model from the training data and use the model to predict whether a new paper is authored by a specific author with the name. However, the methods are user-dependent. It is impractical to train thousands of models for all individuals in a large digital library.

2.3 Topic Modeling

Considerable work has been conducted for investigating topic models or latent semantic structures for text mining. For example, Hofmann [17] proposes the probabilistic latent semantic indexing (pLSI) and applies it to information retrieval (IR). Blei et al. [8] introduce a three-level Bayesian network, called Latent Dirichlet Allocation (LDA). The basic generative process of LDA closely resembles pLSI except that in pLSI, the topic mixture is conditioned on each document while in LDA, the topic mixture is drawn from a conjugate Dirichlet prior that remains the same for all documents.
Some other work has been conducted for modeling both author interests and document contents together. For example, the Author model [21] is aimed at modeling the author interests with a one-toone correspondence between topics and authors. The Author-Topic model [25] [26] integrates the authorship into the topic model and
can find a topic mixture over documents and authors.
Compared with the previous topic modeling work, in this paper, we propose a unified topic model to simultaneously model the topical aspects of different types of information in the academic network.

2.4 Academic Search

For academic search, several research issues have been intensively investigated, for example expert finding and association search. Expert finding is one of the most important issues for mining social networks. For example, both Nie et al. [24] and Balog et al.

[4] propose extended language models to address the expert finding problem. From 2005, Text REtrieval Conference (TREC) has provided a platform with the Enterprise Search Track for researchers to empirically assess their methods for expert finding [13].
Association search aims at finding connections between people. For example, the ReferralWeb [18] system helps people search and explore social networks on the Web. Adamic and Adar [1] have investigated the problem of association search in email networks.
However, existing work mainly focuses on how to find connections between people and ignores how to rank the found associations. In addition, a few systems have been developed for academic search such as, scholar.google.com, libra.msra.cn, citeseer.ist.psu, and Rexa.info. Though much work has been performed, to the best of our knowledge, the issues we focus on in this work (i.e., profile extraction, name disambiguation, and academic network modeling) have not been sufficiently investigated. Our system addresses all these problems holistically.

3. OVERVIEW OF ARNETMINER

Figure 1 shows the architecture of our ArnetMiner system. The system mainly consists of five main components:

Extraction: it focuses on extracting researcher profiles from the Web automatically. It first collects and identifies one’s homepage from the Web, then uses a unified approach to extract the profile properties from the identified document. It extracts publications from online digital libraries using rules.
Integration: it integrates the extracted researchers’ profiles and the extracted publications by using the researcher name as the identifier. A probabilistic framework has been proposed to deal with the name ambiguity problem in the integration. The integrated data is stored into a researcher network knowledge base (RNKB).
Storage and Access: it provides storage and index for the extracted/integrated data in the RNKB. Specifically, for storage it employs MySQL and for index, it employs the inverted file indexing method [3].
Modeling: it utilizes a generative probabilistic model to simultaneously model different types of information. It estimates a topic distribution for each type of information.
Search Services: based on the modeling results, it provides several search services: expertise search and association search. It also provides other services, e.g., author interest finding and academic suggestion (such as paper suggestion and citation suggestion).
It is challenging in many ways to implement these components. First, the previous extraction work has been usually conducted on a specific data set. It is not immediately clear whether such methods can be directly adapted to the global Web. Secondly, it is unclear how to deal with the disambiguation problem by making full use of the extracted information. For example, how to use the relationships between publications. Thirdly, there is no existing model that can simultaneously model the different types of information in the academic network. Finally, different strategies for modeling the academic network have different behaviors. It is necessary to study how different they are and which one would be the best for academic search.
Based on these considerations, for profile extraction, name disambiguation, and modeling, we propose new approaches to overcome the drawbacks that exist in the traditional methods. For storage and access, we utilize the classical methods, because these issues have been intensively investigated and the existing methods can result in good performance in our system.

4. RESEARCHER PROFILE EXTRACTION

4.1 Problem Definition

Profile extraction is the process of extacting the value of each property in a person profile. We define the schema of the researcher profile (as shown in Figure 2) by extending the FOAF ontology [9].
We perform a statistical study on randomly selected 1, 000 researchers from ArnetMiner and find that it is non-trivial to perform profile extraction from the Web. We observed that 85.62% of the researchers are faculty members from universities and 14.38% are from company research centers. For researchers from the same
company, they may share a template-based homepage. However, different companies have different templates. For researchers from universities, the layout and the content of their homepages vary largely. We have also found that 71.88% of the 1, 000 Web pages are researchers’ homepages and the rest are pages introducing the researchers. Characteristics of the two types of pages significantly differ from each other.
We also analyze the content of the Web pages and find that about 40% of the profile properties are presented in tables/lists and the others are presented in natural language text. This suggests a method without using global context information in the page would be ineffective. Statistical study also unveils that (strong) dependencies exist between different profile properties. For example, there are 1, 325 cases (14.54%) in our data of which the extraction needs to use the extraction results of other properties. An ideal method should consider processing all the subtasks holistically.

4.2 A Unified Approach to Profiling

4.2.1 Process

The proposed approach consists of three steps: relevant page identification, preprocessing, and extraction. In relevant page identification, given a researcher name, we first get a list of web pages by a search engine (we use the Google API) and then identify the homepage/introducing page using a binary classifier. We use Sup-port Vector Machines (SVM) [12] as the classification model and define features such as whether the title of the page contains the person name and whether the URL address (partly) contains the person name. The performance of the classifier is 92.39% by F1-
measure. In preprocessing, (a) we separate the text into tokens and (b) we assign possible tags to each token. The tokens form the basic units and the pages form the sequences of units in the tagging problem. In tagging, given a sequence of units, we determine the most likely corresponding sequence of tags by using a trained tagging model. Each tag corresponds to a property defined in Figure 2, e.g., ‘Position’. In this paper, we make use of Conditional Random Fields (CRFs) [20] as the tagging model. Next we describe the steps (a) and (b) in detail.

(a) We identify tokens in the Web page using heuristics. We
define five types of tokens: ‘standard word’, ‘special word’, ‘’ token, term, and punctuation mark. Standard words are unigram words in natural language. Special words include email, URL, date, number, percentage, words containing special terms (e.g. ‘Ph.D.’ and ‘.NET’), special symbols (e.g. ‘===’ and ‘###’). We identify special words by using regular expressions. ‘’ tokens (used for identifying person photos and email addresses) are ‘’ tags in the HTML file. Terms are base noun phrases extracted from the Web page by using a tool based on technologies proposed in [30].
(b) We assign tags to each token based on the token type. For example, for a standard word, we assign all possible tags corresponding to all properties. For a special word, we assign tags indicating Position, Affiliation, Email, Address, Phone, Fax, Bsdate, Msdate, and Phddate. For a ‘’ token, we assign two tags: Photo and Email, because an email address is sometimes shown as an image). After each token is assigned with several possible tags, we can perform most of the profiling tasks using the tags (extracting 19 properties defined in Figure 2).

4.2.2 CRF model and Features

We employ Conditional Random Fields (CRF) as the tagging model. CRF is a conditional probability of a sequence of tags given a sequence of observations [20]. For tagging, a trained CRF model is used to find the sequence of tags Y∗having the highest likelihood Y∗ = maxY P(Y |X). The CRF model is built with thelabeled data by means of an iterative algorithm based on Maximum Likelihood Estimation. Three types of features were defined in the CRF model: content features, pattern features, and term features. The features were defined for different kinds of tokens. Table 1 shows the defined features. We incorporate the defined features into the CRF model by defining Boolean-valued feature functions. Finally, 108,409 features were used in our experiments.

4.3 Profile Extraction Performance

For evaluating our profiling method, we randomly chose 1, 000 researcher names in total from our researcher network. We used the method described in Section 4.2.1 to find the researchers’ homepages or introducing pages. If the method cannot find a Web page for a researcher, we removed the researcher name from the data set. We finally obtained 898 Web pages (one for each researcher). Seven human annotators conducted annotation on the Web pages. A spec was created to guide the annotation process. On disagreements in the annotation, we conducted ‘majority voting’. In the
experiments, we conducted evaluations in terms of precision, recall, and F1-measure for each profile property. We defined baselines for profile extraction. We used the rule learning and the classification based approaches as baselines. For the former, we employed the Amilcare tool, which is based on a rule induction algorithm: LP2 [11]. For the latter, we trained a
classifier to identify the value of each property. We employed Support Vector Machines (SVM) [12] as the classification model. Experimental results show that our method results in a performance of 83.37% in terms of average F1-measure; while Amilcare and SVM result in 53.44% and 73.57%, respectively. Our method clearly outperforms the two baseline methods. We have also found that the performance of the unified method decreases (−11.28% by F1) when removing the transition features, which indicates that a unified approach is necessary for researcher profiling.
We investigated the contribution of each feature type in profile
extraction. We employed only content features, content+term features, content+pattern features, and all features to train the models and conducted the profile extraction. Figure 3 shows the average F1-scores of profile extraction with different feature types. The results unveil contributions of individual features in the extraction. We see that solely using one type of features cannot obtain accurate profiling results. Detailed evaluations can be found in [28].

5. NAME DISAMBIGUATION

5.1 Problem Definition

We integrate the publication data from the online database including DBLP bibliography, ACM Digital library, CiteSeer, and others. For integrating the researcher profiles and the publications, we use the researcher name and the publication author name as the identifier. The method inevitably has the ambiguity problem. We give a formal definition of the name disambiguation task in our context. Given a person name a, we denote all publications having the author name a as P = {p1, p2, · · · , pn}. Each publication pi has six attributes: paper title (pi.title), publication venue

We define five types of relationships between papers (Table 2). Relationship r1 represents two papers are published at the same venue. Relationship r2 means two papers have a secondary author with the same name, and relationship r3 means one paper cites the other paper. Relationship r4 indicates a constraint-based relationship supplied via user feedback. For instance, the user can specify that two specific papers should be assigned to a same person. We use an example to explain relationship r5. Suppose pi has authors ‘David Mitchell’ and ‘Andrew Mark’, and pj has authors ‘David
Mitchell’ and ‘Fernando Mulford’. We are to disambiguate ‘David Mitchell’. If ‘Andrew Mark’ and ‘Fernando Mulford’ also coauthor a paper, then we say pi and pj have a 2-CoAuthor relationship. In our currently experiments, we empirically set the weights of relationships w1 ∼ w5 as 0.2, 0.7, 0.3, 1.0, 0.7 τ. The publication data with relationships can be modeled as a graph comprising of nodes and edges. Each attribute of a paper is attached to the corresponding node as a feature vector. In the vector, we use words (after stop words filtering and stemming) in the attributes as features and use their numbers of occurrences as the values.

5.2 A Unified Probabilistic Framework

5.2.1 Formalization using HMRF

We propose a probabilistic framework based on Hidden Markov Random Fields (HMRF) [5], which can capture dependencies between observations (with each paper being viewed as an observation). The disambiguation problem is cast as assigning a tag to each paper with each tag representing an actual researcher. Specifically, we define a-posteriori probability as the objective function. We aims at maximizing the objective function. The five types of relationships are incorporated into the objective function. According to HMRF, the conditional distribution of the researcher labels y given the observations x (papers) is

where D(xi, yh) is the distance between paper xi and researcher yh and D(xi, xj )is the distance between papers xi and xj ; rk(xi, xj ) denotes a relationship between xi and xj ; wk is the weight of the relationship; and Z is a normalization factor.

5.2.2 EM framework

Three tasks are executed by the Expectation Maximization method: estimation of parameters in the distance measure, re-assignment of papers to researchers, and update of researcher representatives yh. We define the distance function D(xi, xj ) as follows:

here A is defined as a diagonal matrix, for simplicity. Each element in A denotes the weight of the corresponding feature in x. The EM process can be summarized as follows: in the E-step, given the researcher representatives, each paper is assigned to a researcher by maximizing P(y|x). In the M-step, the researcher representative yh is re-estimated from the assignments, and the distance measure is updated to maximize the objective function again.

The assignment of a paper is performed while keeping assignments of the other papers fixed. The assignment process is repeated after all papers are assigned. This process runs until no paper changes its assignment between two successive iterations. In the M-step, each researcher representative is updated by the

5.3 Name Disambiguation Performance

To evaluate our method, we created a data set that consists of 14 real person names (six are from the author’s lab and the others are from [31]). Statistics of this data set are shown in Table 3. Five human annotators conducted disambiguation for the names. A spec was created to guide the annotation process. The labeling work was carried out based on authors’ affiliations, emails, and publications on their homepages.
We defined a baseline based on the method from [27] except that [27] also utilizes a search engine to help the disambiguation. The method is based on hierarchical clustering. We also compared our approach with the DISTINCT method [31]. In all experiments, we
suppose that the number of persons k is provided empirically. Table 4 shows the results. We see that our method significantly outperforms the baseline method for name disambiguation (+10.75% in terms of the average F1-score). The baseline method suffers from two disadvantages: 1) it cannot take advantage of relationships between papers and 2) it relies on a fixed distance measure. Figure 4 shows the comparison results of our method and DISTINCT [31]. We used the person names evaluated in both [31] and our experiments for comparison. We see that for some names, our approach significantly outperforms DISTINCT (e.g., ‘Michael Wagner’); while for other names our approach underperforms DISTINCT (e.g. ‘Bin Yu’). We further investigated the contribution of each relationship type.
We first removed all relationships and then added them to our approach one by one: CoPubvenue, Citation, CoAuthor, and τ -CoAuthor. At each step, we evaluated the performance of our approach (cf. Figure 5). We see that without using the relationships the disambiguation performance drops sharply (−44.72% by F1) and by adding the relationships, improvements can be obtained at each step. This confirms us that a framework by integrating relationships for name disambiguation is worthwhile and each defined relationship in our method is helpful. We can also see that the CoAuthor relationship is the major contributor (+24.38% by F1).

6. MODELING ACADEMIC NETWORK

Modeling the academic network is critical to any searching or suggesting tasks. Traditionally, information is usually represented based on the ‘bag of words’ (BOW) assumption. The method tends to be overly specific in terms of matching words.

Recently, probabilistic topic models such as probabilistic Latent Semantic Indexing (pLSI) [17], Latent Dirichlet Allocation (LDA) [8], and Author-Topic model [25] [26] have been proposed as well as successfully applied to multiple text mining tasks such as information retrieval [29], collaborative filtering [8] [16], and paper reviewer finding [22]. However, these models are not sufficient to
model the whole academic network, as they cannot model topical aspects of all types of information in the academic network. We propose a unified topic model for simultaneously modeling the topical distribution of papers, authors, and conferences. For simplicity, we use conference to denote conference, journal, and book hereafter. The learned topic distribution can be used to further estimate the inter-dependencies between different types of information, e.g., the closeness between a conference and an author.

6.1 Our Proposed Topic Models

The proposed model is called Author-Conference-Topic (ACT) model. Three different strategies are employed to implement the topic model (as shown in Figure 6). In the first model (ACT1, Figure 6 (a)), each author is associated with a multinomial distribution over topics and each word in a paper and the conference stamp is generated from a sampled topic. In the second model (ACT2, Figure 6 (b)), each author conference pair is associated with a multinomial distribution over topics and
each word is then generated from a sampled topic.
In the third model (ACT3, Figure 6 ©), each author is associated with a topic distribution and the conference stamp is generated after topics have been sampled for all word tokens in a paper. The different implementations reduces the process of writing a scientific paper to different series of probabilistic steps. They have different behaviors in the academic applications. In the remainder
of this section, we will describe the three models in more detail.

6.2 ACT Model 1

In the first model (Figure 6(a)), the conference information is viewed as a stamp associated with each word in a paper. Intuition behind the first model is: coauthors of a paper determine topics written in this paper and each topic then generates the words and determines a proportion of the publication venue. The generative process can be summarized as follows:

posterior probability is calculated by the following:

6.3 ACT Model 2

In the second model (cf. Figure 6(b)), each topic is chosen from a multinomial topic distribution specific to an author-conference pair, instead of an author as that in ACT1. The model is derived from the observation: when writing a paper, coauthors usually first choose a publication venue and then write the paper based on themes of the publication venue and interests of the authors. The corresponding generative process is:

6.4 ACT Model 3

In the third model (cf. Figure 6©), the conference stamp is
taken as a numerical value. Each conference stamp of a paper is chosen after topics have been sampled for all word tokens in the paper. Intuitively, this corresponds to a natural way of publishing the scientific paper: authors first write a paper and then determine where to publish the paper based on the topics discussed in the paper. The corresponding generative process is:

In this model, the conference comes from a normal linear model. The covariates τ in this model are the frequencies of the topics in the document. The regression coefficients on these frequencies constitute η. The difference of parameterization from ACT1 is that the conference stamp is sampled from a normal linear distribution after topics were sampled for all word tokens in a paper.

7. ACADEMIC SEARCH SERVICES

7.1 Applying ACT Models to Expertise Search

In expertise search, the objective is to find the expertise authors, expertise papers, and expertise conferences for a given query.

7.1.1 Process

Based on the proposed models, we can calculate the likelihood of a paper generating a word using ACT1 as the example as following:

The likelihood of an author model and a conference model generating a word can be similarly defined. However, the learned topics by the LDA-style model is usually general and not specific to a given query. Therefore, only using ACT itself is too coarse for academic search [29]. Our preliminary experiments also show that employing only ACT or LDA models to information retrieval hurts the retrieval performance. In general, we would like to have a balance between generality and specificity. Therefore, we derive a combination of the ACT model and the word-based language model:

7.1.2 Expertise Search Performance

We collected a list of the most frequent queries from the log of ArnetMiner for evaluation. We conducted experiments on a subset of the data (including 14, 134 persons, 10, 716 papers, and 1, 434 conferences) from ArnetMiner. For evaluation, we used the method of pooled relevance judgments [10] together with human judgments. Specifically, for each query, we first pooled the top 30 results from three similar systems (Libra, Rexa, and ArnetMiner). Then, two faculty members and five graduate students from CS provided human judgments. Four-grade scores (3, 2, 1, and 0) were assigned respectively representing definite expertise, expertise, marginal expertise, and no expertise. Finally, the judgment scores were averaged to obtain the final score. In all experiments, we conducted evaluation in terms of P@5, P@10, P@20, R-pre, and mean average precision (MAP) [10] [13]. We used language model (LM), LDA [8], and the Author-Topic (AT) model [25] [26] as the baseline methods. For language model, we used Equation (17) to calculate the relevance between a query term and a document and similar equations for an author/conference
(an author is represented by his/her published papers and a conference is represented by papers published on it). For LDA, we used a similar equation to Equation (16) to calculate the relevance of a term and a document. For the AT model, we used similar equations to Equation (16) to calculate the relevance of a query term with a paper or an author. For the LDA and AT models, we performed model estimation with the same setting as that for the ACT models. We empirically set the number of topics as T = 80 for all models.

Table 5 shows five topics discovered by ACT1.
Table 6 shows the experimental results of retrieving papers, authors, and conferences using our proposed methods and the baseline methods. We see that our proposed three methods outperform the baseline methods. LDA only models documents and thus can support only paper search; while AT supports paper search and author search. Both models underperform our proposed unified models. Our models benefit from the ability of modeling all kinds of information holistically, thus can capture the dependencies between the different types of information. We can also see that ACT1 achieves the best performance in all evaluation measures. For comparison purposes, we also evaluate the results of two similar systems: Libra.msra.cn and Rexa.info. The average MAP obtained by Libra and Rexa on our data set are 48.3% and 45.0%. We see that our methods clearly outperform the two systems.

7.2 Applying ACT Models to Association Search

Association Search: Given a social network G = (V, E) and an association query (ai, aj ) (source person, target person), association search is to find and rank possible associations {αk(ai, aj )} from ai to aj . Each association is denoted as a referral chain of persons. There are two subtasks in association search: finding possible as sociations between two persons and ranking the associations. Given a large social network, to find all associations is an NP-hard problem. We instead focus on finding the ‘shortest’ associations. Hence, the problem becomes how to estimate the score of an association and one key issue is how to calculate the distance between persons. We use KL divergence to define the distance as:

We use the accumulated distance between persons on an association path as the score of the association. We call the association with the smallest score as the shortest association and our problem can be formalized as that of finding the near-shortest associations. Our approach consists of two stages:

7.3 Other Applications

Our model can support many other applications, e.g., author interest finding and academic suggestion.
For example, Table 7 shows top 5 words and top 5 authors associated to two conferences found by ACT1. Table 8 shows top 5 words and top 5 conferences associated to two researchers found by ACT1. The results can be directly used to characterize the conference themes and researcher interests. They can be also used for prediction/suggestion tasks. For example, one can use the model to find the best matching reviewer for a paper submitted to a specific conference. Previously, such work is fulfilled by only keyword matching or topic-based retrieval such as [22], but not considering the conference. One can also use the model to suggest a venue to submit a paper based on its content and authors’ interests. Or one can use it to suggest popular topics when authors prepare a paper for a conference.

8. CONCLUSION

In this paper, we describe the architecture and the main features of the ArnetMiner system. Specifically, we propose a unified tagging approach to researcher profiling. About a half million researcher profiles have been extracted into the system. The system has also integrated more than one million papers. We propose a probabilistic framework to deal with the name ambiguity problem in the integration. We further propose a unified topic model to simultaneously model the different types of information in the academic network. The modeling results have been applied to expertise search and association search. We conduct experiments for evaluating each of the proposed approaches. Experimental results indicate that the proposed methods can achieve a high performance.
There are many potential future directions of this work. It would be interesting to further investigate new extraction models for improving the accuracy of profile extraction. It would be also interesting to investigate how to determine the actual person number k for name disambiguation. Currently, the number is supplied manually, which is not practical for all author names. In addition, extending the topic model with link information (e.g., citation information) or time information is a promising direction.

9. ACKNOWLEDGMENTS

The work is supported by the National Natural Science Foundation of China (90604025, 60703059), Chinese National Key Foundation Research and Development Plan (2007CB310803), and Chinese Young Faculty Research Funding (20070003093). It is also supported by IBM Innovation funding.

10. REFERENCES

[1] L. A. Adamic and E. Adar. How to search a social network. Social Networks, 27:187–203, 2005.
[2] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50:5–43, 2003.

[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.
ACM Press, 1999.
[4] K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert
finding in enterprise corpora. In Proc. of SIGIR’06, pages 43–55,
2006.
[5] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework
for semi-supervised clustering. In Proc. of KDD’04, pages 59–68,
2004.
[6] R. Bekkerman and A. McCallum. Disambiguating web appearances
of people in a social network. In Proc. of WWW’05, pages 463–470,
2005.
[7] D. M. Blei and J. D. McAuliffe. Supervised topic models. In Proc. of
NIPS’07, 2007.
[8] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
Journal of Machine Learning Research, 3:993–1022, 2003.
[9] D. Brickley and L. Miller. Foaf vocabulary specification. In
Namespace Document, http://xmlns.com/foaf/0.1/, September 2004.
[10] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete
information. In Proc. of SIGIR’04, pages 25–32, 2004.
[11] F. Ciravegna. An adaptive algorithm for information extraction from
web-related texts. In Proc. of IJCAI’01 Workshop, August 2001.
[12] C. Cortes and V. Vapnikn. Support-vector networks. Machine
Learning, 20:273–297, 1995.
[13] N. Craswell, A. P. de Vries, and I. Soboroff. Overview of the
trec-2005 enterprise track. In TREC’05, pages 199–205, 2005.
[14] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two
supervised learning approaches for name disambiguation in author
citations. In Proc. of JCDL’04, pages 296–305, 2004.
[15] H. Han, H. Zha, and C. L. Giles. Name disambiguation in author
citations using a k-way spectral clustering method. In Proc. of
JCDL’05, pages 334–343, 2005.
[16] T. Hofmann. Collaborative filerting via gaussian probabilistic latent
semantic analysis. In Proc.of SIGIR’03, pages 259–266, 1999.
[17] T. Hofmann. Probabilistic latent semantic indexing. In Proc.of
SIGIR’99, pages 50–57, 1999.
[18] H. Kautz, B. Selman, and M. Shah. Referral web: Combining social
networks and collaborative filtering. Communications of the ACM,
40(3):63–65, 1997.
[19] T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive
information extraction with constrained conditional random fields. In
Proc. of AAAI’04, 2004.
[20] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In
Proc. of ICML’01, 2001.
[21] A. McCallum. Multi-label text classification with a mixture model
trained by em. In Proc. of AAAI’99 Workshop, 1999.
[22] D. Mimno and A. McCallum. Expertise modeling for matching
papers with reviewers. In Proc. of KDD’07, pages 500–509, 2007.
[23] T. Minka. Estimating a dirichlet distribution. In Technique Report,
http://research.microsoft.com/ minka/papers/dirichlet/, 2003.
[24] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval.
In Proc. of WWW’07, pages 81–90, 2007.
[25] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The
author-topic model for authors and documents. In Proc. of UAI’04,
2004.
[26] M. Steyvers, P. Smyth, and T. Griffiths. Probabilistic author-topic
models for information discovery. In Proc. of SIGKDD’04, 2004.
[27] Y. F. Tan, M.-Y. Kan, and D. Lee. Search engine driven author
disambiguation. In Proc. of JCDL’06, pages 314–315, 2006.
[28] J. Tang, D. Zhang, and L. Yao. Social network extraction of academic
researchers. In Proc. of ICDM’07, pages 292–301, 2007.
[29] X. Wei and W. B. Croft. Lda-based document models for ad-hoc
retrieval. In Proc. of SIGIR’06, pages 178–185, 2006.
[30] E. Xun, C. Huang, and M. Zhou. A unified statistical model for the
identification of english basenp. In Proc. of ACL’00, 2000.
[31] X. Yin, J. Han, and P. Yu. Object distinction: Distinguishing objects
with identical names. In Proc. of ICDE’2007, pages 1242–1246,
2007.
[32] K. Yu, G. Guan, and M. Zhou. Resume information extraction with
cascaded hybrid model. In Proc. of ACL’05, pages 499–506, 2005.

你可能感兴趣的:(算法,论文研读,机器学习)

opencv-python与opencv-contrib-python的区别联系剑心缘零碎小知识 python opencv
opencv-python包含基本的opencvopencv-contrib-python是高配版，带一些收费或者专利的算法，还有一些比较新的算法的高级版本,这些算法稳定之后会加入上面那个。官网对contrib模块的简介（点击链接跳转）参考链接
通信算法之278：数据链/自组网通信设备--MIMO(2T2R)-OFDM系统系列--实际工程应用算法代码--1.系统指标需求及帧结构设计秋风战士无线通信基带处理算法 MATLAB仿真软件无线电算法无人机经验分享
MIMO(2T2R)-OFDM系统系列–实际工程应用算法代码第一章：系统指标需求拆解分析第二章：通信系统帧结构设计和OFDM参数设计第三章：通信业务速率设计及理论解调门限第四章：同步序列设计及同步性能仿真验证第五章：数据业务设计及性能仿真验证第六章：信道模型设计第七章：接收关键算法设计及仿真验证第八章：其它待补充本文目录MIMO(2T2R)-OFDM系统系列--实际工程应用算法代码一、实际项目：系
通信算法之287：通信技术点咨询秋风战士 MATLAB仿真软件无线电无线通信基带处理算法网络算法无人机经验分享
专业技术咨询方向第一：SFBC编码与解码原理推导第二：SFBC系统中信道均衡推导第三：云哨物理层协议-速率匹配-解调门限-5dB第四：两天线SCFDE系统（SFBC码）帧结构设计第五：两天线OFDM系统（SFBC码）帧结构设计第一：SFBC编码与解码原理推导第二：SFBC系统中信道均衡推导第三：云哨物理层协议-速率匹配-解调门限-5dB第四：两天线SCFDE系统（SFBC码）帧结构设计第五：两天线
【计算机毕业设计】基于Springboot的办公用品管理系统+LW 枫叶学长(专业接毕设) Java毕业设计实战案例课程设计 spring boot 后端
博主介绍：✌全网粉丝3W+,csdn特邀作者、CSDN新星计划导师、Java领域优质创作者,掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和学生毕业项目实战,高校老师/讲师/同行前辈交流✌技术范围：SpringBoot、Vue、SSM、HLMT、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、小程序、安卓app、大数据、物联网、机器学习等设计与开发。主要内容：
知识图谱的个性化智能教学推荐系统(论文+源码) 毕设工作室_wlzytw python论文项目知识图谱人工智能
目录摘要Abstract目录第1章绪论1.1研究背景及意义1.2国内外研究现状1.2.1知识图谱1.2.2个性化推荐系统1.3本文研究内容及创新点1.4全文组织结构第2章相关理论与技术概述2.1知识图谱2.1.1知识图谱的介绍与发展2.1.2知识图谱的构建2.3协同过滤推荐算法2.2.1推荐算法概述2.2.2Pearson相关系数2.2.3Spearman相关系数2.4Bert模型和Albert模
反向传播神经网络极简入门自信哥
单个神经元神经网络是多个“神经元”（感知机）的带权级联，神经网络算法可以提供非线性的复杂模型，它有两个参数：权值矩阵{Wl}和偏置向量{bl}，不同于感知机的单一向量形式，{Wl}是复数个矩阵，{bl}是复数个向量，其中的元素分别属于单个层，而每个层的组成单元，就是神经元。神经元神经网络是由多个“神经元”（感知机）组成的，每个神经元图示如下：这其实就是一个单层感知机，其输入是由和+1组成的向量，其
【TVM 教程】如何处理 TVM 报错
ApacheTVM是一个深度的深度学习编译框架，适用于CPU、GPU和各种机器学习加速芯片。更多TVM中文文档可访问→https://tvm.hyper.ai/运行TVM时，可能会遇到如下报错：---------------------------------------------------------------AnerroroccurredduringtheexecutionofTVM.F
【限时干货】Calibre智能分类，轻松突破内网限制畅享电子书库比头发还脆弱服务器 tcp/ip linux
文章目录前言1.网络书库软件下载安装2.网络书库服务器设置3.内网穿透工具设置4.公网使用kindle访问内网私人书库前言本研究旨在构建一套运行于微软操作系统环境下的独立电子图书管理体系，核心目标是建立可远程操作的资源访问机制。该架构采用高可用性设计，在第三方阅读平台服务中断时仍能保障数字内容传输的稳定性。系统创新性地融合了两大核心技术组件：通过Calibre开源软件实现文献分类算法与格式转换功能
【PaddleOCR】OCR文本检测与文本识别数据集整理，持续更新......
博主简介：曾任某智慧城市类企业算法总监，目前在美国市场的物流公司从事高级算法工程师一职，深耕人工智能领域，精通python数据挖掘、可视化、机器学习等，发表过AI相关的专利并多次在AI类比赛中获奖。CSDN人工智能领域的优质创作者，提供AI相关的技术咨询、项目开发和个性化解决方案等服务，如有需要请站内私信或者联系任意文章底部的的VX名片（ID：xf982831907）博主粉丝群介绍：①群内初中生、
说话人识别python_基于各种分类算法的说话人识别（年龄段识别） weixin_39673184 说话人识别python
基于各种分类算法的语音分类(年龄段识别)概述实习期间作为帮手打杂进行了一段时间的语音识别研究，内容是基于各种分类算法的语音的年龄段识别，总结一下大致框架，基本思想是：获取语料库TIMIT提取数据特征，进行处理MFCC/i-vectorLDA/PLDA/PCA语料提取，基于分类算法进行分类SVM/SVR/GMM/GBDT...用到的工具有HTK(C,shell)/Kaldi(C++,shell)/L
深入解析C++中 std::sort背后的实现原理 —Introsort（Introspective Sort）点云SLAM C++c++算法数据结构快速排序排序算法堆排序深度优先
Introsort简介Introsort是一种混合排序算法，结合了三种经典算法的优点：算法用于特点快速排序通常情况平均时间复杂度O(nlogn)堆排序当快速排序退化（递归过深）时最坏时间复杂度O(nlogn)插入排序小规模数组时（如长度≤16）常数开销小，快Introsort运行机制排序逻辑如下：if(size2*log2(n))堆排序（HeapSort）else快速排序（QuickSort）快速
冒泡排序算法详解（含Python代码实现）算法_小学生算法
冒泡排序（BubbleSort）是最基础的排序算法之一，通常用于学习排序算法的入门理解。本文将通过Python代码实现冒泡排序，并详细讲解其原理、执行流程、复杂度分析及适用情况。✨一、算法简介冒泡排序的核心思想是：相邻两个元素比较，将较大的元素不断“冒泡”至右侧，最终实现排序。其基本过程是重复比较相邻的元素，如果顺序错误就交换，重复这一过程，直到没有任何需要交换的为止。二、Python代码实现下面
揭秘 Spring Cloud Zuul 在后端的负载均衡策略大厂资深架构师 Spring Boot 开发实战 spring cloud 负载均衡 spring ai
揭秘SpringCloudZuul在后端的负载均衡策略关键词：SpringCloudZuul、负载均衡、微服务网关、Ribbon、请求路由摘要：在微服务架构中，API网关是流量的“总调度员”，而负载均衡则是它的“智能大脑”。本文将以“小区门卫派件”为故事主线，用通俗易懂的语言揭秘SpringCloudZuul如何通过集成Ribbon实现后端负载均衡。我们将从核心概念到算法原理，从代码实战到应用场景
ImportError: /nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4 爱编程的喵喵 Python基础课程 python ImportError torch nvJitLink 解决方案
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了ImportError:/home/
【NWFSP问题】基于中华穿山甲算法CPO求解零等待流水车间调度问题NWFSP研究（Matlab代码实现）
欢迎来到本博客❤️❤️博主优势：博客内容尽量做到思维缜密，逻辑清晰，为了方便读者。⛳️座右铭：行百里者，半于九十。本文目录如下：目录⛳️赠与读者1概述1.引言2.理论基础2.1中华穿山甲算法（CPO）核心原理2.2NWFSP数学模型3.CPO-NWFSP求解框架设计3.1编码与解码3.2离散化位置更新3.3目标函数适配4.实验设计与性能分析4.1实验设置4.2结果分析4.3敏感性分析5.结论与展望
【机器学习笔记 Ⅱ】11 决策树模型巴伦是只猫机器学习机器学习笔记决策树
决策树模型（DecisionTree）详解决策树是一种树形结构的监督学习模型，通过一系列规则对数据进行分类或回归。其核心思想是模仿人类决策过程，通过不断提问（基于特征划分）逐步逼近答案。1.核心概念节点类型：根节点：起始问题（最佳特征划分点）。内部节点：中间决策步骤（特征判断）。叶节点：最终预测结果（类别或数值）。分支：对应特征的取值或条件判断（如“年龄≥30？”）。2.构建决策树的关键步骤(1)
【机器学习笔记 Ⅱ】10 完整周期
机器学习的完整生命周期（End-to-EndPipeline）机器学习的完整周期涵盖从问题定义到模型部署的全过程，以下是系统化的步骤分解和关键要点：1.问题定义（ProblemDefinition）目标：明确业务需求与机器学习任务的匹配性。关键问题：这是分类、回归、聚类还是强化学习问题？成功的标准是什么？（如准确率>90%、降低10%成本）输出：项目目标文档（含评估指标）。2.数据收集（DataC
【机器学习笔记Ⅰ】13 正则化代价函数
正则化代价函数（RegularizedCostFunction）详解正则化代价函数是机器学习中用于防止模型过拟合的核心技术，通过在原始代价函数中添加惩罚项，约束模型参数的大小，从而提高泛化能力。以下是系统化的解析：1.为什么需要正则化？过拟合问题：当模型过于复杂（如高阶多项式回归、深度神经网络）时，可能完美拟合训练数据但泛化性能差。解决方案：在代价函数中增加对参数的惩罚，抑制不重要的特征权重。2.
【机器学习笔记Ⅰ】6 多类特征巴伦是只猫机器学习机器学习笔记人工智能
多类特征（Multi-classFeatures）详解多类特征是指一个特征（变量）可以取多个离散的类别值，且这些类别之间没有内在的顺序关系。这类特征是机器学习中常见的数据类型，尤其在分类和回归问题中需要特殊处理。1.核心概念(1)什么是多类特征？定义：特征是离散的、有限的类别，且类别之间无大小或顺序关系。示例：颜色：红、绿、蓝（无顺序）。城市：北京、上海、广州（无数学意义的大小关系）。动物类别：猫
图像分割技术详解：从原理到实践 lanjieying
本文还有配套的精品资源，点击获取简介：图像分割是图像处理领域将图像分解为多个区域的过程，用于图像分析、特征提取等。文章介绍了图像分割的原理，并通过一个将图像划分为2*4子块的示例，展示了如何使用Python和matplotlib库中的tight_subplot函数进行图像分割和展示。文章还探讨了图像分割在不同领域的应用，以及如何在机器学习项目中作为数据预处理步骤。1.图像分割基本概念在图像处理领域
机器学习笔记——支持向量机 star_and_sun 机器学习笔记支持向量机
支持向量机参数模型对分布需要假设（这也是与非参数模型的区别之一）间隔最大化，形式转化为凸二次规划问题最大化间隔间隔最大化是意思：对训练集有着充分大的确信度来分类训练数据，最难以分的点也有足够大的信度将其分开间隔最大化的分离超平面的的求解怎么求呢？最终的方法如下1.线性可分的支持向量机的优化目标其实就是找得到分离的的超平面求得参数w和b的值就可以了注意，最大间隔分离超平面是唯一的，间隔叫硬间隔1.1
【机器学习&深度学习】多分类评估策略一叶千舟深度学习【理论】深度学习【应用必备常识】大数据人工智能
目录前言一、多分类3大策略✅宏平均（MacroAverage）✅加权平均（WeightedAverage）✅微平均（MicroAverage）二、类比理解2.1宏平均（MacroAverage）2.1.1计算方式2.1.2适合场景2.1.3宏平均不适用的场景2.1.4宏平均一般用在哪些指标上？2.1.5怎么看macroavg指标？2.1.6宏平均值低说明了什么？2.1.7从宏平均指标中定位模型短板
LRU Cache Mr_Xuhhh c++c语言算法开发语言 python
LRUCache定义缓存算法（LeastRecentlyUsed)核心思想最近最少使用或最久未使用。当缓存空间不足时，它会优先淘汰最长时间没有访问的数据项类比：图书馆的书架管理，经常被借阅的书放在最前面方便取用，而长期无人问津的书会被移到后面或下架数据结构选择与设计1）双向链表1.用于维护元素的访问顺序，最近访问的元素放在链表头部，最久未被访问的放在尾部2.支持O（1）时间复杂度的任意位置插入和删
【机器学习笔记Ⅰ】7 向量化巴伦是只猫机器学习机器学习笔记人工智能
向量化（Vectorization）详解向量化是将数据或操作转换为向量（或矩阵）形式，并利用并行计算高效处理的技术。它是机器学习和数值计算中的核心优化手段，能显著提升代码运行效率（尤其在Python中避免显式循环）。1.为什么需要向量化？(1)传统循环的缺陷低效：Python的for循环逐元素操作，速度慢。代码冗长：需手动处理每个元素。示例：计算两个数组的点积（非向量化）a=[1,2,3]b=[4
【Python】simulink与python联合仿真
1.1Simulink的边界：事件驱动、算法复杂性与AI集成瓶颈Simulink的核心优势在于其强大的微分方程求解器和对连续时间系统、离散时间系统的精确描述能力。其基于“信号流”和“框图”的建模范式，使得工程师可以直观地构建与物理现实高度对应的数学模型。然而，这种优势也带来了其天然的局限性：基于时间的驱动核心(Time-BasedCoreEngine):Simulink的“心脏”是一个时间驱动的仿
【PyTorch】教程：torch.nn.GELU 老周有AI~算法定制 PyTorch pytorch 深度学习 python
torch.nn.GELU原型CLASStorch.nn.GELU(approximate='none')参数approximate(str,optional)–gelu近似算法用none或者tanh，默认为none;定义高斯误差线性单元函数GELU(x)=x∗ϕ(x)\text{GELU}(x)=x*\phi(x)GELU(x)=x∗ϕ(x)其中ϕ(x)\phi(x)ϕ(x)为高斯分布的累积分布
数据结构之栈实验 lannnn_ 学习记录数据结构 c语言栈
栈实验实验目的实验环境实验要求实验内容源代码运行结果实验目的掌握栈这种数据结构特性及其主要存储结构，并能在现实生活中灵活运用。实验环境CodeBlocks实验要求1.熟悉c语言的语法知识；2.掌握栈的顺序存储结构—顺序栈的定义、构造、获得栈顶元素、入栈、出栈等基本操作；实验内容完成栈的定义、构造、获得栈顶元素、进栈、出栈等函数的编写。要求在主函数中实现对以上操作的调用，编写一个算法判断给定的字符向
新手必看：入行大模型前一定要知道的几件事！和老莫一起学AI 人工智能 java 机器学习大模型算法程序员转行
大模型怎么转？适合哪些人？哪些方向对新手友好？又有哪些坑你必须避开？文章有点长，但全是我这几年观察下来最真实的经验，如果你真的想搞懂大模型、入场不踩坑，建议认真读完，或先收藏慢慢看。一、大模型≠ChatGPT，先搞清“全景图”再出发说句真话，很多人对“大模型”的第一印象就是——ChatGPT。但这只是它的"最上层"，底下的基建、平台、算法、数据处理、推理部署……才是撑起整个技术栈的骨架。入行大模型
php字符串匹配算法,字符串查找算法及原理
面试题:判断字符串是否在另一个字符串中存在？面试时发现好多人回答不好,所以就梳理了一下已知的方法,此文较长,需要耐心的看下去。从实现和算法原理两方面解此问题，其中有用PHP原生方法实现也有一些业界大牛创造的算法。实现方法一:语言特性-内置函数/*strpos示例*///testecho'match:',strpos('xasfsdfbk','xasfsdfbk')!==false?'true':'
李宏毅2025《机器学习》第四讲-Transformer架构的演进
Transformer架构的演进与替代方案：从RNN到Mamba的技术思辨Transformer作为当前AI领域的标准架构，其设计并非凭空而来，也并非没有缺点。本次讨论的核心便是：新兴的架构，如MAMA，是如何针对Transformer的弱点进行改进，并试图提供一个更优的解决方案的。要理解架构的演进，我们必须首先明确一个核心原则：每一种神经网络架构，都有其存在的技术理由。CNN（卷积神经网络）：为
深入浅出Java Annotation(元注解和自定义注解） Josh_Persistence Java Annotation 元注解自定义注解
一、基本概述　　 Annontation是Java5开始引入的新特征。中文名称一般叫注解。它提供了一种安全的类似注释的机制，用来将任何的信息或元数据（metadata）与程序元素（类、方法、成员变量等）进行关联。　　更通俗的意思是为程序的元素（类、方法、成员变量）加上更直观更明了的说明，这些说明信息是与程序的业务逻辑无关，并且是供指定的工具或
mysql优化特定类型的查询 annan211 java 工作 mysql
本节所介绍的查询优化的技巧都是和特定版本相关的，所以对于未来mysql的版本未必适用。 1 优化count查询对于count这个函数的网上的大部分资料都是错误的或者是理解的都是一知半解的。在做优化之前我们先来看看真正的count()函数的作用到底是什么。 count()是一个特殊的函数，有两种非常不同的作用，他可以统计某个列值的数量，也可以统计行数。在统
MAC下安装多版本JDK和切换几种方式棋子chessman jdk
环境： MAC AIR,OS X 10.10,64位历史：过去 Mac 上的 Java 都是由 Apple 自己提供，只支持到 Java 6，并且OS X 10.7 开始系统并不自带（而是可选安装）（原自带的是1.6）。后来 Apple 加入 OpenJDK 继续支持 Java 6，而 Java 7 将由 Oracle 负责提供。在终端中输入jav
javaScript （1） Array_06 JavaScript java 浏览器
JavaScript 1、运算符　　运算符就是完成操作的一系列符号，它有七类：　　赋值运算符（=,+=,-=,*=,/=,%=,<<=,>>=,|=,&=）、算术运算符(+,-,*,/,++,--,%)、比较运算符(>,<,<=,>=,==,===,!=,!==)、逻辑运算符(||,&&,!)、条件运算(?:)、位
国内顶级代码分享网站袁潇含 java jdk oracle .net PHP
现在国内很多开源网站感觉都是为了利益而做的当然利益是肯定的,否则谁也不会免费的去做网站 &
Elasticsearch、MongoDB和Hadoop比较随意而生 mongodb hadoop 搜索引擎
IT界在过去几年中出现了一个有趣的现象。很多新的技术出现并立即拥抱了“大数据”。稍微老一点的技术也会将大数据添进自己的特性，避免落大部队太远，我们看到了不同技术之间的边际的模糊化。假如你有诸如Elasticsearch或者Solr这样的搜索引擎，它们存储着JSON文档，MongoDB存着JSON文档，或者一堆JSON文档存放在一个Hadoop集群的HDFS中。你可以使用这三种配
mac os 系统科研软件总结张亚雄 mac os
1.1 Microsoft Office for Mac 2011 大客户版，自行搜索。 1.2 Latex （MacTex）: 系统环境：https://tug.org/mactex/ &nb
Maven实战（四）生命周期 AdyZhang maven
1. 三套生命周期 Maven拥有三套相互独立的生命周期，它们分别为clean，default和site。每个生命周期包含一些阶段，这些阶段是有顺序的，并且后面的阶段依赖于前面的阶段，用户和Maven最直接的交互方式就是调用这些生命周期阶段。以clean生命周期为例，它包含的阶段有pre-clean, clean 和 post
Linux下Jenkins迁移 aijuans Jenkins
1. 将Jenkins程序目录copy过去源程序在/export/data/tomcatRoot/ofctest-jenkins.jd.com下面 tar -cvzf jenkins.tar.gz ofctest-jenkins.jd.com &
request.getInputStream()只能获取一次的问题 ayaoxinchao request Inputstream
问题：在使用HTTP协议实现应用间接口通信时，服务端读取客户端请求过来的数据，会用到request.getInputStream()，第一次读取的时候可以读取到数据，但是接下来的读取操作都读取不到数据原因： 1. 一个InputStream对象在被读取完成后，将无法被再次读取，始终返回-1； 2. InputStream并没有实现reset方法（可以重
数据库SQL优化大总结之百万级数据库优化方案 BigBird2012 SQL优化
网上关于SQL优化的教程很多，但是比较杂乱。近日有空整理了一下，写出来跟大家分享一下，其中有错误和不足的地方，还请大家纠正补充。这篇文章我花费了大量的时间查找资料、修改、排版，希望大家阅读之后，感觉好的话推荐给更多的人，让更多的人看到、纠正以及补充。 1.对查询进行优化，要尽量避免全表扫描，首先应考虑在 where 及 order by 涉及的列上建立索引。 2.应尽量避免在 where
jsonObject的使用 bijian1013 java json
在项目中难免会用java处理json格式的数据，因此封装了一个JSONUtil工具类。 JSONUtil.java package com.bijian.json.study; import java.util.ArrayList; import java.util.Date; import java.util.HashMap;
[Zookeeper学习笔记之六]Zookeeper源代码分析之Zookeeper.WatchRegistration bit1129 zookeeper
Zookeeper类是Zookeeper提供给用户访问Zookeeper service的主要API，它包含了如下几个内部类首先分析它的内部类，从WatchRegistration开始，为指定的znode path注册一个Watcher， /** * Register a watcher for a particular p
【Scala十三】Scala核心七：部分应用函数 bit1129 scala
何为部分应用函数？ Partially applied function: A function that’s used in an expression and that misses some of its arguments.For instance, if function f has type Int => Int => Int, then f and f(1) are p
Tomcat Error listenerStart 终极大法 ronin47 tomcat
Tomcat报的错太含糊了，什么错都没报出来，只提示了Error listenerStart。为了调试，我们要获得更详细的日志。可以在WEB-INF/classes目录下新建一个文件叫logging.properties，内容如下 Java代码 handlers = org.apache.juli.FileHandler, java.util.logging.ConsoleHa
不用加减符号实现加减法 BrokenDreams 实现
今天有群友发了一个问题，要求不用加减符号(包括负号)来实现加减法。分析一下，先看最简单的情况，假设1+1，按二进制算的话结果是10，可以看到从右往左的第一位变为0，第二位由于进位变为1。
读《研磨设计模式》-代码笔记-状态模式-State bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /* 当一个对象的内在状态改变时允许改变其行为，这个对象看起来像是改变了其类状态模式主要解决的是当控制一个对象状态的条件表达式过于复杂时的情况把状态的判断逻辑转移到表示不同状态的一系列类中，可以把复杂的判断逻辑简化如果在
CUDA程序block和thread超出硬件允许值时的异常 cherishLC CUDA
调用CUDA的核函数时指定block 和 thread大小，该大小可以是dim3类型的（三维数组），只用一维时可以是usigned int型的。以下程序验证了当block或thread大小超出硬件允许值时会产生异常！！！GPU根本不会执行运算！！！所以验证结果的正确性很重要！！！在VS中创建CUDA项目会有一个模板，里面有更详细的状态验证。以下程序在K5000GPU上跑的。
诡异的超长时间GC问题定位 chenchao051 jvm cms GC hbase swap
HBase的GC策略采用PawNew+CMS, 这是大众化的配置，ParNew经常会出现停顿时间特别长的情况，有时候甚至长到令人发指的地步，例如请看如下日志： 2012-10-17T05:54:54.293+0800: 739594.224: [GC 739606.508: [ParNew: 996800K->110720K(996800K), 178.8826900 secs] 3700
maven环境快速搭建 daizj 安装 mavne 环境配置
一下载maven 安装maven之前，要先安装jdk及配置JAVA_HOME环境变量。这个安装和配置java环境不用多说。 maven下载地址：http://maven.apache.org/download.html，目前最新的是这个apache-maven-3.2.5-bin.zip，然后解压在任意位置，最好地址中不要带中文字符，这个做java 的都知道，地址中出现中文会出现很多
PHP网站安全，避免PHP网站受到攻击的方法 dcj3sjt126com PHP
对于PHP网站安全主要存在这样几种攻击方式:1、命令注入(Command Injection)2、eval注入(Eval Injection)3、客户端脚本攻击(Script Insertion)4、跨网站脚本攻击(Cross Site Scripting, XSS)5、SQL注入攻击(SQL injection)6、跨网站请求伪造攻击(Cross Site Request Forgerie
yii中给CGridView设置默认的排序根据时间倒序的方法 dcj3sjt126com GridView
public function searchWithRelated() { $criteria = new CDbCriteria; $criteria->together = true; //without th
Java集合对象和数组对象的转换 dyy_gusi java集合
在开发中，我们经常需要将集合对象（List，Set）转换为数组对象，或者将数组对象转换为集合对象。Java提供了相互转换的工具，但是我们使用的时候需要注意，不能乱用滥用。 1、数组对象转换为集合对象最暴力的方式是new一个集合对象，然后遍历数组，依次将数组中的元素放入到新的集合中，但是这样做显然过
nginx同一主机部署多个应用 geeksun nginx
近日有一需求，需要在一台主机上用nginx部署2个php应用，分别是wordpress和wiki，探索了半天，终于部署好了，下面把过程记录下来。 1. 在nginx下创建vhosts目录，用以放置vhost文件。 mkdir vhosts 2. 修改nginx.conf的配置，在http节点增加下面内容设置，用来包含vhosts里的配置文件 #
ubuntu添加admin权限的用户账号 hongtoushizi ubuntu useradd
ubuntu创建账号的方式通常用到两种：useradd 和adduser . 本人尝试了useradd方法，步骤如下： 1:useradd 使用useradd时，如果后面不加任何参数的话，如：sudo useradd sysadm 创建出来的用户将是默认的三无用户：无home directory ,无密码,无系统shell。顾应该如下操作：
第五章常用Lua开发库2-JSON库、编码转换、字符串处理 jinnianshilongnian nginx lua
JSON库在进行数据传输时JSON格式目前应用广泛，因此从Lua对象与JSON字符串之间相互转换是一个非常常见的功能；目前Lua也有几个JSON库，本人用过cjson、dkjson。其中cjson的语法严格（比如unicode \u0020\u7eaf），要求符合规范否则会解析失败（如\u002），而dkjson相对宽松，当然也可以通过修改cjson的源码来完成
Spring定时器配置的两种实现方式OpenSymphony Quartz和java Timer详解 yaerfeng1989 timer quartz 定时器
原创整理不易，转载请注明出处：Spring定时器配置的两种实现方式OpenSymphony Quartz和java Timer详解代码下载地址：http://www.zuidaima.com/share/1772648445103104.htm 有两种流行Spring定时器配置：Java的Timer类和OpenSymphony的Quartz。 1.Java Timer定时首先继承jav
Linux下df与du两个命令的差别？ pda158 linux
　一、df显示文件系统的使用情况，与du比較，就是更全盘化。　　最经常使用的就是 df -T，显示文件系统的使用情况并显示文件系统的类型。　　举比例如以下：　　[root@localhost ~]# df -T 　　Filesystem Type &n
[转]SQLite的工具类 ---- 通过反射把Cursor封装到VO对象 ctfzh VO android sqlite 反射 Cursor
在写DAO层时，觉得从Cursor里一个一个的取出字段值再装到VO(值对象)里太麻烦了，就写了一个工具类，用到了反射，可以把查询记录的值装到对应的VO里，也可以生成该VO的List。使用时需要注意：考虑到Android的性能问题，VO没有使用Setter和Getter，而是直接用public的属性。表中的字段名需要和VO的属性名一样，要是不一样就得在查询的SQL中
该学习笔记用到的Employee表 vipbooks oracle sql 工作
这是我在学习Oracle是用到的Employee表，在该笔记中用到的就是这张表，大家可以用它来学习和练习。 drop table Employee; -- 员工信息表 create table Employee( -- 员工编号 EmpNo number(3) primary key, -- 姓