来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT。
The essential characteristics of a social network are:
Social networks are naturally modeled as graphs, which we sometimes refer to as a social graph. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges. Often, social graphs are undirected, as for the Facebook friends graph. But they can be directed graphs, as for example the graphs of followers on Twitter or Google+.
There are many examples of social networks other than “friends” networks. Here, let us enumerate some of the other examples of networks that also exhibit locality of relationships.
Telephone Networks
Here the nodes represent phone numbers, which are really individuals. There is an edge between two nodes if a call has been placed between those phones in some fixed period of time, such as last month, or “ever.” The edges could be weighted by the number of calls made between these phones during the period. Communities in a telephone network will form from groups of people that communicate frequently: groups of friends, members of a club, or people working at the same company, for example.
Email Networks
The nodes represent email addresses, which are again individuals. An edge represents the fact that there was at least one email in at least one direction between the two addresses. Alternatively, we may only place an edge if there were emails in both directions. In that way, we avoid viewing spammers as “friends” with all their victims. Another approach is to label edges as weak or strong. Strong edges represent communication in both directions, while weak edges indicate that the communication was in one direction only. The communities seen in email networks come from the same sorts of groupings we mentioned in connection with telephone networks. A similar sort of network involves people who text other people through their cell phones.
Collaboration Networks
Nodes represent individuals who have published research papers. There is an edge between two individuals who published one or more papers jointly. Optionally, we can label edges by the number of joint publications. The communities in this network are authors working on a particular topic.
An alternative view of the same data is as a graph in which the nodes are papers. Two papers are connected by an edge if they have at least one author in common. Now, we form communities that are collections of papers on the same topic.
There are several other kinds of data that form two networks in a similar way. For example, we can look at the people who edit Wikipedia articles and the articles that they edit. Two editors are connected if they have edited an article in common. The communities are groups of editors that are interested in the same subject. Dually, we can build a network of articles, and connect
articles if they have been edited by the same person. Here, we get communities of articles on similar or related subjects.
In fact, the data involved in Collaborative filtering, as was discussed in Chapter 9, often can be viewed as forming a pair of networks, one for the customers and one for the products. Customers who buy the same sorts of products, e.g., science-fiction books, will form communities, and dually, products that are bought by the same customers will form communities, e.g., all science-fiction books.
Other Examples of Social Graphs
Many other phenomena give rise to graphs that look something like social graphs, especially exhibiting locality. Examples include: information networks (documents, web graphs, patents), infrastructure networks (roads, planes, water pipes, powergrids), biological networks (genes, proteins, food-webs of animals eating each other), as well as other types, like product co-purchasing networks (e.g., Groupon).
Figure 10.2 is an example of a tripartite graph (the case k = 3 of a k-partite graph). There are three sets of nodes, which we may think of as users { U 1 , U 2 } \{U_1, U_2\} {U1,U2}, tags { T 1 , T 2 , T 3 , T 4 } \{T_1, T_2, T_3, T_4\} {T1,T2,T3,T4}, and Web pages { W 1 , W 2 , W 3 } \{W_1, W_2, W_3\} {W1,W2,W3}.
Define the betweenness of an edge ( a , b ) (a, b) (a,b) to be the number of pairs of nodes x x x and y y y such that the edge ( a , b ) (a, b) (a,b) lies on the shortest path between x x x and y y y. To be more precise, since there can be several shortest paths between x x x and y y y, edge ( a , b ) (a, b) (a,b) is credited with the fraction of those shortest paths that include the edge ( a , b ) (a, b) (a,b). As in golf, a high score is bad. It suggests that the edge ( a , b ) (a, b) (a,b) runs between two different communities; that is, a a a and b b b do not belong to the same community.
Following is Girvan-Newman Algorithm
Can see this blog: Girvan-Newman Algorithm
A proper definition of a “good” cut must balance the size of the cut itself against the difference in the sizes of the sets that the cut creates. One choice that serves well is the “normalized cut.” First, define the volume of a set S of nodes, denoted V o l ( S ) Vol(S) Vol(S), to be the number of edges with at least one end in S. Suppose we partition the nodes of a graph into two disjoint sets S and T .
Let C u t ( S , T ) Cut(S,T) Cut(S,T) be the number of edges that connect a node in S to a node in T . Then the normalized cut value for S and T is
C u t ( S , T ) V o l ( S ) + C u t ( S , T ) V o l ( T ) \frac{Cut(S,T)}{Vol(S)}+\frac{Cut(S,T)}{Vol(T)} Vol(S)Cut(S,T)+Vol(T)Cut(S,T)
Suppose our graph has adjacency matrix A and degree matrix D. Our third matrix, called the Laplacian matrix, is L = D − A L = D − A L=D−A, the difference between the degree matrix and the adjacency matrix. That is, the Laplacian matrix L has the same entries as D on the diagonal. Off the diagonal, at row i and column j, L has −1 if there is an edge between nodes i and j and 0 if not.
Eigenvalues of the Laplacian Matrix
The smallest eigenvalue for every Laplacian matrix is 0, and its corresponding eigenvector is [1, 1, . . . , 1].
There is a simple way to find the second-smallest eigenvalue for any matrix, such as the Laplacian matrix, that is symmetric (the entry in row i i i and column
j j j equals the entry in row j j j and column i i i). While we shall not prove this fact, the second-smallest eigenvalue of L is the minimum of x T L x x^TLx xTLx, where x = [ x 1 , x 2 , . . . , x n ] x = [x_1, x_2, . . . , x_n] x=[x1,x2,...,xn] is a column vector with n components, and the minimum is
taken under the constraints:
When L is a Laplacian matrix for an n-node graph, we know something more. The eigenvector associated with the smallest eigenvalue is 1. Thus, if x is orthogonal to 1, we must have
x T 1 = ∑ i = 1 n x i = 0 \bold{x}^T\bold{1}=\sum_{i=1}^nx_i=0 xT1=i=1∑nxi=0
In addition for the Laplacian matrix, the expression x T L x x^TLx xTLx has a useful equivalent expression. Recall that L = D − A L = D − A L=D−A, where D and A are the degree and adjacency matrices of the same graph. Thus, x T L x = x T D x − x T A x x^TLx = x^TDx − x^TAx xTLx=xTDx−xTAx.
The Affiliation-Graph Model
The elements of the graph-generation mechanism are:
There is a solution to the problem caused by the affiliation-graph model, where membership of individuals in communities is discrete: either you are a member of the community or not. Instead of “all-or-nothing” membership of nodes in communities, we can suppose that for each node and each community, there is a “strength of membership” for that node and community. Intuitively, the stronger the membership of two individuals in the same community is, the more likely it is that this community will cause there to be an edge between them.
In this model, we can adjust the strength of membership for an individual in a community continuously, just as we can adjust the probability associated with a community in the affiliation-graph model. That improvement allows us to use methods for optimizing continuous functions, such as gradient descent, to maximize the expression for likelihood. In the new model, we have
Fixed sets of communities and individuals, as before.
For each community C and individual x, there is a strength of membership
parameter F x C F_{xC} FxC . These parameters can take any nonnegative value, and a value of 0 means the individual is definitely not in the community.
The probability that community C causes there to be an edge between nodes u and v is
p C ( u , v ) = 1 − e − F u C F v C p_C(u,v)=1-e^{-F_{uC}F_{vC}} pC(u,v)=1−e−FuCFvC
As before, the probability of there being an edge between u and v is 1 minus the probability that none of the communities causes there to be an edge between them. That is, each community independently causes edges, and an edge exists between two nodes if any community causes it to exist. More formally, p u v p_{uv} puv, the probability of an edge between nodes u and v, is
p u v = 1 − ∏ C ( 1 − p C ( u , v ) ) p_{uv}=1-\prod_C(1-p_C(u,v)) puv=1−C∏(1−pC(u,v))
Finally, let E be the set of edges in the observed graph. As before, we can write the formula for the likelihood of the observed graph as the product of the expression for p u v p_{uv} puv for each edge ( u , v ) (u, v) (u,v) that is in E, times the product of 1 − p u v 1−p_{uv} 1−puv for each edge ( u , v ) (u, v) (u,v) that is not in E. Thus, in the new model, the formula for the likelihood of the graph with edges E is
∏ ( u , v ) i n E ( 1 − e − F u C F v C ) ∏ ( u , v ) n o t i n E ( e − F u C F v C ) \prod_{(u,v)\space in \space E}(1-e^{-F_{uC}F_{vC}})\prod_{(u,v)\space not\space in \space E}(e^{-F_{uC}F_{vC}}) (u,v) in E∏(1−e−FuCFvC)(u,v) not in E∏(e−FuCFvC)
Continuous Versus Discrete Optimization
Whenever you are trying to find an optimum solution to a problem where some of the elements are continuous (e.g., the probabilities of communities inducing edges) and some elements are discrete (e.g., the memberships of the communities), it might make sense to replace each discrete element by a continuous variable. That change, while it doesn’t strictly speaking represent reality, enables us to perform only one optimization, which will use a continuous optimization method such as gradient descent. We run the risk of winding up in a local optimum that is not the globally best solution.
In this section, we shall take up another approach to analyzing social-network graphs. This technique, called “simrank,” applies best to graphs with several types of nodes, although it can in principle be applied to any graph. The purpose of simrank is to measure the similarity between nodes of the same type, and it does so by seeing where random walkers on the graph wind up when starting at a particular node, the source node. Because calculation must be carried out once for each source node, there is a limit to the size of graphs that can be analyzed completely in this manner. However, we shall also offer an algorithm that approximates simrank, but is far more efficient than iterated matrix-vector multiplication. Finally, we show how simrank can be used to find communities.
Let us use β as the probability that the walker continues at random, so 1 − β is the probability the walker will teleport to the source node S. Let e S e_S eS be the column vector that has 1 in the row for node S and 0’s elsewhere. Then if v is the column vector that reflects the probability the walker is at each of the nodes at a particular round, and v ′ v′ v′ is the probability the walker is at each of the nodes at the next round, then v ′ v′ v′ is related to v by:
v ′ = β M v + ( 1 − β ) e S v'=\beta Mv+(1-\beta)e_S v′=βMv+(1−β)eS
A simple approach is to add nodes until the density of edges – the fraction of edges that connect nodes in the community divided by the number of possible edges connecting nodes in the community (i.e., C 2 n C_2^n C2n for an n-node community) – goes below a threshold.
Another measure of the goodness of a community C is the conductance. To define conductance, we need first to define the volume of a community C, which is the smaller of the sum of the degrees of the nodes in C and the sum of the degrees of the nodes not in C. Then the conductance of C is the ratio of the number of edges with exactly one end in C divided by the volume of C. The intuitive idea behind conductance is that it looks for sets C that are neither too small nor too large, and yet can be disconnected from the network by cutting a relatively small number of edges. Small conductance suggests a closely knit community.
The neighborhood of radius d for a node v is the set of nodes u for which there is a path of length at most d from v to u. We denote this neighborhood by N ( v , d ) N(v, d) N(v,d). For example, N(v, 0) is always {v}, and N(v, 1) is v plus the set of nodes to which there is an arc from v. More generally, if V is a set of nodes, then N(V, d) is the set of nodes u for which there is a path of length d or less from at least one node in the set V .
The neighborhood profile of a node v is the sequence of sizes of its neighborhoods ∣ N ( v , 1 ) ∣ , ∣ N ( v , 2 ) ∣ , . . . . |N(v, 1)|, |N(v, 2)|, . . . . ∣N(v,1)∣,∣N(v,2)∣,.... We do not include the neighborhood of distance 0, since its size is always 1.