讲解：CSC3021、Java、Concurrent Programming、JavaHaskell|R

The PageRank ProblemHans VandierendonckCSC3021 Concurrent Programming, 2018–’19PageRank is Google’s algorithm to rank the search results that match the queried keywords [1]. The algorithmmodels the internet as a directed graph where web pages are represented as vertices and links between pages areedges. The PageRank algorithm calculates the likelihood, or rank, that a page will be visited by people surfingthe web. The rank of a page depends on the rank of the pages that link to it, where pages that are frequentlypointed to tend to gain a higher rank. Also, pages pointed to by highly ranked pages tend to receive a higher rank.The algorithm is itself quite simple but in practice it captures well the appreciation of the importance of pages byhumans.You will create a Java program that solves the PageRank problem in Assignment 1; and you will create aconcurrent version of that code in Assignment 2. This document provides essential background on the PageRankproblem. You are encouraged to initiate your own reading on this subject.1 GraphsLet’s talk first about graphs. A graph is a collection of items that are connected. We call the items vertices andthe connections are called edges or links. An edge connects two vertices. We assume edges are directed, i.e., theypoint form a source vertex (also called a tail) to a destination vertex (also called a head).In mathematical terms, a graph is described as a pair G = (V, E) where V is the set of vertices and E is the setof edges. We describe an edge as an ordered pair of vertices, i.e., E ? V × V . The number of vertices is given by|V | and the number of edges is given by |E|.An edge is described as an ordered pair. As such, the pair (u, v) describes an edge pointing from u to v,assuming u and v are elements of the set V . We say that u is the source of the edge and v is the destination. Wealso say that the edge is incident to u and to v.The number of edges incident to a vertex is called the degree of the vertex. A vertex with degree 5 will thusappear in the description of 5 edges, either as source or destination. We further distinguish the in-degree and theout-degree. The in-degree of vertex v is the number of edges pointing to v. It is indicated by deg(v). The outdegreeis the number of edges pointing away from v. It is indicated by deg+(v). There is a relationship betweenthe different types of degrees. In the absence of self-edges (edges pointing from a vertex to itself), it is always thecase that deg?(v) + deg+(v) = deg(v).A graph may be directed or undirected. In a directed graph, each edge is an ordered pair, pointing from a sourceto a destination vertex. Our definition of a graph focusses on directed graphs. In an undirected graph, edges haveno direction. They link two vertices without notion of source and destination. We emulate undirected graphs byassuming that the connection between two vertices u and v is recorded by a directed edge from u to v and anotherdirected edge pointing from v to u. As such, in our representation, the number of edges in an undirected graph isalways an even number.2 Problem StatementThe PageRank algorithm assumes that a person surfing the web visits a sequence of web pages. The next page inthe sequence is determined either by following one of the outgoing links of the current page, or by restarting the1sequence in a random web page. The PageRank value P R(d) for a page d is recursively defined by the followingequation:P R(d) = 1 αn+ dXs∈in(d)P R(s)deg+(s)(1)where n is the number of vertices in the web graph, in(d) is the set of web pages pointing to the page d, deg+(s) isthe number of outgoing links of page s (the out-degree) and α is the damping factor. This equation has two terms.The first term states that with probability 1 ? α the surfer visits a random web page. The second term models thelikelihood of arriving at page d by following a sequence of links. Note that page s has deg+(s) outgoing linksand that its PageRank value is equally distributed among all these links. As such, every link is equally likely to befollowed.Because of its importance, efficient ways to solve this equation have been extensively investigated. We willdescribe an iterative method that is easy to implement and is amenable to concurrent execution.3 The Power Iteration MethodThe power iteration method solves the recursive equation 1 by feeding in estimates for P R(s) in the right-handsideof the equation and calculating new pagerank values by evaluating the formula. The newly calculated valuesare then fed into the right-hand-side again and the process is repeated until the pagerank values converge, i.e., theychange only marginally in subsequent steps.In order to describe the power iteration method unambiguously, we introduce an extra parameter t that indicatesthe iteration and let P R(s;t) represent the pagerank of page s in iteration t.Initially, the page ranks are uniformly initialised to the same value. As they represent a probability, we makesure that they all add up to 1 and we set them toP R(s; 0) = 1n(2)Given that the PageRank values at iteration t are known, the power method calculates the PageRank values atiteration t + 1 using the following formula;P R(d;t + 1) = 1 αn+ αXs∈in(d)P R(s;t)deg+(s)(3)This formula is identical to Equation 1 with the iteration numbers added.The power method continues for as many iterations as necessary until the L1-norm of the difference betweenPageRank values in subsequent iterations is less than a pre-defined error bound �: Xp|P R(p;t + 1) �P R(p;t)|!where |x| produces the absolute value of x. Often, � = 10?7is selected.Algorithm 1 shows how the PageRank problem can be solved using the power iteration method. It uses twoarrays to store PageRank values: pr stores the current estimates of the PageRank values, i.e., P R(v;t). newprstores the new estimates which are being made in the current iteration of the Power Method, i.e., P R(v;t + 1).Each iteration of the power method evaluates Equation 3.First, the initial pagerank values are set up in line 1, in accordance with Equation 2. The while loop starting atline 3 loops over the iterations of the Power Iteration Method. It checks if the computation has converged. Initially,it has not. The algorithm then proceeds to initialise the new PageRank estimates in newpr to zero. It then moveson to evaluate the main part of Equation 3. Lines 5 to 9 are written in strict correspondence with the right-handterm in Equation 3 (the constants 1?αnare not explicitly added up). That means: for each vertex d, we sum up the2Algorithm 1: Solving the PageRank Problem using the Power Iteration Method.Data: G = (V, E), a graph. Dampen factor α, accepted error �Result: pr: the PageRank values1 for v ∈ V do pr[v] = 1n;2 converged = f alse;3 while not converged do4 for v ∈ V do newpr[v] = 0 ;5 for d ∈ V do6 for s ∈ in(d) do7 newpr[d]+ = αr[s]deg+(s);8 end9 end// Normalize vector to sum to 110 l = 0;11 for v ∈ V do l+ = newpr[v] ;12 for v ∈ V do newpr[v]// Check convergence13 l = 0;14 for v ∈ V do l+ = abs(pr[v] ? newpr[v]) ;15 converged = (l // Swap to new solution16 for v ∈ V do pr[v] = newpr[v] ;17 endAlgorithm 2: Alternative formulation for lines 5-9 of Algorithm 11 for e ∈ E do2 s = source(e);3 d = destination(e);4 newpr[d]+ = αpr[s]eg+(s);5 endPageRank contributions of all its incoming edges. The set of the source vertices of the incoming edges is in(d).For each vertex s in this set, we look up the current PageRank estimate in pr, divide by the out-degree of s andmultiply with the damping factor α.So far, we have performed the main part of evaluating Equation 3, but we have not added in the constants 1αn.We could add a loop that adds this constant to each entry of newpr, however, that is actually unnecessary. A keyproblem in any algorithm working with real numbers is that the computer cannot always perform the arithmeticexactly (see the appendix on real number representation). As such, we need to compensate for errors kreeping inin the computation. The observable error is that the PageRank values do not add up to 1. We can compensate forthis by multiplying the PageRank values such that the sum does add up to 1. Hereto, we calculate the sum of thePageRank values (l in line 11) and we then multiply the PageRank values by 1such that they sum up to 1 inline 12. As a side effect of this normalisation step, we don’t need to add in the constants The power iteration then finishes off by checking convergence (line 15) and copy the newly computed estimatesof the PageRank values from newpr to pr (line 16).The key part of Algorithm 1 are lines 5-9 and this is where we will focus on. There are many ways in whichthese lines can be encoded. Algorithm 2 shows how instead of using two loops over destinations d and sources s,an equivalent code is to use one loop over all edges, and to extract source and destination from the edge.34 Representing Graphs as Sparse MatricesThe discussion so far has considered a description of the algorithm using pseudo-code. To make the algorithmconcrete, we need to choose what data structures we will use to represent the graph. When choosing a data structure,we need to consider what operations are performed on the graph. We can discern the following: It must be possible to iterate over all vertices, e.g., Algorithm 1, line 1 It must be possible to iterate over all edges, e.g., Algorithm 1, line 5 or Algorithm 2. It must be possible to retrieve the out-degree of a vertexThe first constraint can be resolved easily if we pre-process the graph and conveniently represent the verticesas integers in a dense range, e.g., we assign to each vertex an integer in the range 0, · · · , n 1 for a graph with nvertices.The second constraint is more critical and requires thought. As we will be dealing with very large graphs, it isbest to utilise representations that use a few very large arrays. We will focus on fairly easy representations, muchmore complex representations are described in the literature [2].Depending on the data structure we choose for the graph, it may be possible to retrieve the out-degree of a vertexdirectly from the graph data structure, which would satisfy the third constraint. Alternatively, we can calculate theout-degree of each vertex once at the start of the program and store these values in代写CSC3021留学生作业、代做Java编程作业、代写Concurrent Programming作业、代做Java语 an array. Then we can look themup efficiently.4.1 Adjacency MatrixThe adjacency matrix of a graph G = (V, E) is a square binary matrix with dimension |V |. Assume that the numberof vertices is represented by n (n = |V |) and the number of edges by m (m = |E|). Assume further that verticesare labelled with integer numbers in the range 0, · · · , n ? 1, i.e., V = 0, 1, 2, · · · , n ? 1. The element aij on row iand column j in the adjacency matrix is determined as follows:11. aij = 1 if an edge exists from vertex j to vertex i2. aij = 0 if the graph does not contain an edge from vertex j to vertex iNote that the adjacency matrix is typically a sparse matrix: most of the entries are 0. Sparse matrices may berepresented as 2-dimensional arrays, however, this would take a lot of space, namely n2elements. Because of thelarger number of zeroes in the matrix, there exist more efficient alternatives that typically take space proportionalto the number of non-zeroes, i.e., proportional to the number of edges, which is typically much smaller than n2.4.2 Coordinate Format, a.k.a. Edge ListsA straightforward representation of a sparse matrix is the Coordinate Formate, abbreviated to COO: simply list alledges as pairs of source and destination. As such, an array is required of length m which holds the pairs (i, j) ∈ E.Alternatively, one may also create two arrays of length m where one array stores the source vertices and theother stores the destinations of an edge. Sources and destinations are stored in corresponding locations: if the sourceof an edge is stored at index k in the source array, then its destination is stored also at index k in the destinations.2Note that the COO format does not specify in which order the edges are stored. That makes it inefficient toanswer certain queries, such as retrieving all the edges that are incident to a particular vertex. This is, however,irrelevant to the PageRank problem (as Algorithm 2 may be used).1The definition given is actually the transpose of the adjacency matrix, which is obtained by swapping the position of the indices i andj: the element at row i and column j in the transpose of a matrix A equals the element at row j and column i in A. It is custom in graphanalytics to operate on the transpose of the adjacency matrix.2Terminology: the distinction between these two data layouts is known as the distinction between an “array of structures” layout vs a“structures of arrays” layout.40134 25 destination0 5 5 6 8 91 2 3 4 5 4 4 5 5 0 1 2 3 4indexCSR formatsource0 1 3 5 7 115 0 5 0 5 0 5 0 2 3 5 0 3 4indexCSC format1414Figure 1: A small graph with 6 vertices and 14 edges; and its representation in the CSR and CSC formats4.3 Compressed Sparse RowsThe Compressed Sparse Rows format, abbreviated to CSR, provides an indexed representation to the graph. Inparticular, it individually compresses each row of the adjacency matrix and stores only the destination vertices, i.e.,it stores the values j where aij = 1. This is a compressed representation of the dense matrix, which would store allvalues aij for j = 0, · · · , n ? 1 on every row.The CSR representation requires two arrays: a destination array containing the destination vertex IDs for eachedge, and an index array containing for every source vertex the index of its first destination ID in the destinationarray. The destination array has length m. The index array has length n + 1, where an extra element is inserted tosimplify the traversal of the data structure.Figure 1 shows the CSR format for an exemplar graph.The CSR representation allows one to quickly recover all outgoing edges of a vertex with index k: (i) lookupthe numbers index[k] and index[k+1] in the index array; (ii) the destinations of the outgoig edges for vertexk are found in the destinations array at positions j=index[k] up to but not including j=index[k+1].4.4 Compressed Sparse ColumnsThe Compressed Sparse Columns format, abbreviated to CSC, is an indexed representation to the graph, similar toCSR. The main difference is that it lists the incoming edges as opposed to the outgoing edges. As such, it uses anindex array of length n + 1 and a source array of length m. The process for traversing the graph is analogous tothe CSR. Figure 1 graphically illustrates the CSC representation.5 Reference SolutionTo help you with debugging your code, Table 1 lists the PageRank values as they are calculated by the referencesolution to this exercise. PageRank values seem to convergence from iteration 18 onwards. Note, however, that weshow only 6 significant digits while we aim to calculate 7 converged digits (� ).References[1] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web.Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120.Available online: http://ilpubs.stanford.edu:8090/422/.[2] Y. Saad. SPARSKIT: A basic tool for sparse matrix computations. Technical Report NASA-CR-185876,NASA, May 1990. Available online: https://ntrs.nasa.gov/search.jsp?R=19910023551.5Table 1: PageRank values for the 6-node graph in Figure 1t delta P R(0;t) P R(1;t) P R(2;t) P R(3;t) P R(4;t) P R(5;t)init 0.166666 0.166666 0.166666 0.166666 0.166666 0.1666661 0.547778 0.0769444 0.105278 0.105278 0.105278 0.317778 0.2894442 0.181160 0.0891199 0.102200 0.102200 0.102200 0.236429 0.3678493 0.137640 0.102013 0.117163 0.117163 0.117163 0.247469 0.2990294 0.0634867 0.0924331 0.109775 0.109775 0.109775 0.259158 0.3190835 0.0173711 0.0947956 0.110509 0.110509 0.110509 0.250473 0.3232046 0.0131304 0.0956002 0.111715 0.111715 0.111715 0.252615 0.3166397 0.00674097 0.0946550 0.110907 0.110907 0.110907 0.253344 0.3192808 0.00171397 0.0949894 0.111081 0.111081 0.111081 0.252487 0.3192819 0.00114623 0.0950142 0.111162 0.111162 0.111162 0.252790 0.31870810 6.61535E-4 0.0949284 0.111081 0.111081 0.111081 0.252813 0.31901611 2.39131E-4 0.0949692 0.111107 0.111107 0.111107 0.252735 0.31897512 9.54587E-5 0.0949658 0.111111 0.111111 0.111111 0.252772 0.31893013 6.58410E-5 0.0949588 0.111103 0.111103 0.111103 0.252769 0.31896314 2.89733E-5 0.0949633 0.111106 0.111106 0.111106 0.252762 0.31895515 8.19374E-6 0.0949624 0.111106 0.111106 0.111106 0.252767 0.31895216 6.49790E-6 0.0949619 0.111106 0.111106 0.111106 0.252766 0.31895617 3.18692E-6 0.0949624 0.111106 0.111106 0.111106 0.252765 0.31895418 8.35832E-7 0.0949622 0.111106 0.111106 0.111106 0.252766 0.31895419 5.90395E-7 0.0949622 0.111106 0.111106 0.111106 0.252766 0.31895520 3.23357E-7 0.0949623 0.111106 0.111106 0.111106 0.252766 0.31895421 1.01826E-7 0.0949623 0.111106 0.111106 0.111106 0.252766 0.31895522 4.92322E-8 0.0949623 0.111106 0.111106 0.111106 0.252766 0.3189546A Real Number RepresentationReal numbers are typically represented approximately in a computer using floating-point number representations.These number representations have a finite number of bits, while real numbers may have an infinite number of digits.For instance, the number 13has an infinitely long representation 0.33333 · · ·. As such, numbers are representedwith limited accuracy.The IEEE-754 standard for floating-point number representation specifies that a float value consists of 32bits. These are 1 sign bit to indicate if the number is postive or negative, 8 exponent bits and 23 mantissa bits.Similarly, a double consists of 64 bits, with 1 sign bit, 11 exponent bits and 53 mantissa bits. Let s be the signbit, e the exponent bits and m the mantissa bits of a floating-point number, then the real number that is representedis (?1)s × (1.m) × 2e?b, where b = 127 for float and b = 1023 for double. The sign of the number isretrieved as (?1)0 = 1 and (?1)1 = ?1. The constant b ensures that both very large and very small numberscan be represented. It is furthermore assumed that the exponent is decided such that it indicates the leading (mostsignificant) 1-bit of the real number. As every real number that is not zero has such a bit, we don’t need to store it.The zero value is represented by the special case of s = 0, e = 0 and m = 0.Any real number can be represented with a relative precision of � = 2?M, where M mantissa bits are used.The difference between any real number r and its floating-point representation rb is thus at most �: |r?brbr| ?M.That is good news: a float can represent a real number with a relative error of at most 2?23 ≈ 1.19 × 10?7anda double introduces a relative error of at most 2?53 ≈ 1.11 × 10?16.Because numbers are not always represented exactly, every arithmetic operation using floating-point numberrepresentations may introduce error. The problem results from the representation of the numbers. Assume we havetwo real numbers a and b that can be represented exactly using a floating-point representation as ab = a and bb = b. Ifwe desire to calculate a + b, then the computer will evaluate abd+ bb, i.e., it will perform the addition on the floatingpointrepresentations, calculate an exact result for ab + bb and then finally convert this exact result to a floating-pointnumber representation. The final conversion from the exact result to a floating-point number is signficant: theaddition may generate more non-zero mantissa bits, e.g., because of carry taking place. However, only a limitednumber of mantissa bits can be stored. Some mantissa bits need to be dropped. This causes an inaccuracy offloating-point addition, even if the added values can be represented exactly. The error can be bounded by the same� value as the representation of numbers.A similar observation applies to floating-point subtraction, multiplication and division. In each case, the errorbound is �.If a long chain of computations is performed, then the error will propagate and all individual errors may addup. As such, when n PageRank values are added up in line 11 of Algorithm 1, then the error on the value of lmay be as large as n × 253. Assuming a double representation and a graph with a billion edges, which is lessthan the current size of the World Wide Web, the total error may be as large as 109 × 253 ≈ 109 × 10�1523 ≈10�6/8 > 10 7. In other words, the error due to floating-point arithmetic may be larger than the desired accuracyof the computation (typically 10 7), which means the algorithm would never converge.http://www.5daixie.com/contents/9/1722.html转自：http://ass.3daixie.com/2018101613556873.html

讲解：CSC3021、Java、Concurrent Programming、JavaHaskell|R

你可能感兴趣的:(讲解：CSC3021、Java、Concurrent Programming、JavaHaskell|R)