Google Interview Question Software Engineer / Developers


  • google-interview-questions
    8
    of  8 votes
    48 
    Answers

    Given a large network of computers, each keeping log files of visited urls, find the top ten of the most visited urls. 
    (i.e. have many large int (visits)> maps, calculate implicitly int (sum of visits among all distributed maps), and get the top ten in the combined map) 

    The result list must be exact, and the maps are too large to transmit over the network (especially sending all of them to a central server or using MapReduce directly, is not allowed)

    - chandeepsingh85 on September 26, 2013 in United States Report Duplicate | Flag 
    Google Software Engineer / Developer System Design

Country: United States
Interview Type: In-Person


9
of  13vote

Presuming a protocol exists that can ask three questions to each server: 

* the score of a single url 
* the top 10 
* the top n that satisfy score >= N 

We program a two pass solution like so: 

We denote the number of servers as S. 

[First pass] 
(1) Ask every server for its own top ten 

(2) merge the results. For all URLs in the merged set calculate correct values by asking 
all servers for their scores for each URL. Calculate a set of top ten from our sample. 

(3) pick score of the now tenth URL as the threshold that we try to beat 
in the second round. We denote the threshold as T. 

[Second pass] 
(4) Ask every server for all its top N that satisfy score >= T/S 

(5) Merge these bigger samples again as in step (2) 

(6) We now have the correct top ten with correct scores.

- laperla on September 27, 2013 | FlagReply
-1
of  1 vote

I think they would expect a solution that spreads the work more uniformly among the different servers, which is hard with these constraints.

However, i think you have the merit of getting the right result. +1

- Miguel Oliveira on September 28, 2013 | Flag
0
of  0 votes

Am not sure if this is going to work all the time. Please correct me if i am wong

X:Y Means Urls X has a count of Y
We want top 2 uRLs

M/C A=> 1:100 2:96 7:98 
M/C B=> 3:99 5:97 7:2 
M/C C=> 4:98 6:95 7:2

1st Step
A=>1,2
B=>3,5
C=>4,6

2nd Step
Top two after merging
1,3 Urls are selected

3rd Step
Threshold=99(Selected from url 3)

4th Step
Score = 99/3=33
A=>1,2,7
B=>3,5
C=>4,6

5th Step
Merging will again give us 1,3 when infact Url 7 has the highest count

- Mithun on September 30, 2013 | Flag
0
of  0 votes

Mihun, I think you're missing a step.

Step (2) says you merge the sets and *request* counts from all servers. This will give you the score of 102 for Url 7.

- VSuba on September 30, 2013 | Flag
-1
of  1 vote

Mithun, you made a mistake in your example and I think you misunderstood the suggested approach. 
M/C A=> 1:100 2:96 7:98 , so 1 and 7 would be the top 2. 

Anyway, even if we change it to
M/C A=> 1:100 2:96 7:94 
M/C B=> 3:99 5:97 7:4
M/C C=> 4:98 6:95 7:4

Where 7 is the most visited site as you intended, the merging in 5th step implies that "For all URLs in the merged set calculate correct values by asking 
all servers for their scores for each URL", so we would get 102 as the count for url 7.

- Miguel Oliveira on September 30, 2013 | Flag
0
of  0 votes

Oh okok.. got it.. So in the worst case if all URLs satisfy the score of > T/S, all the URLs will be sent to the server for merging right ?

- Mithun on October 01, 2013 | Flag
0
of  0 votes

The approach is correct ... the only issue being that the solution will not scale and for a large number of machines the network traffic may be gargantuan. Consider an example with top ten counts in range of 1million and the number of servers in 10k range (which is not a big number considering the scale of amazon or google scale) so eventually you will ask for all URL which have count >= T/S which is 100 in this case. So you will end up sending a lot more data than is actually needed (as you will be sending URL for counts between 100 and 1 million). Also the bottleneck in such a solution would be central node processing this algorithm which wont scale but as I said earlier the solution is correct but not scalable.

- vik on October 01, 2013 | Flag
2
of  2 votes

Thanks, vik, for your thoughts on scalability. I think this shows how open ended the question actually is. Without more knowledge about the topology of the machines, datacenters, loadbalancers, etc involved it is not possible to proof the quality and scalability of the algorithm in real life. A few things I would suggest to discuss about this:

- is it a typical weblog file? Typical weblogs have a steep declining start and a long flat tail, so the number of accesses on the tenth rank are usually high: if it is such a distribution, this algorithm is not bad.

- how biased are the loadbalancers? If the loadbalancers have a high degree of randomness built in, then the differences between the machines are statistically already similar enough that the first pass is quite good and the second pass is not much work.

- can clusters of machines be pre-merged? If racks or lines of racks can have a logserver that collects the full logs for the rack or line, then the number of servers to contact can be drastically reduced.

- how often are these calculations repeated? Which similar questions need to be answered? How often? How exact must the numbers be? How up-to-date? Each answer would influence the program and tuning we want to set up to produce fast and good numbers.

- laperla on October 02, 2013 | Flag
0
of  0 votes

Consider this example. Lets find the top 1 URL for now and we can extrapolate the question for top 10 URLs too.

3 Servers
Server 1 : URL A-> 2 URL G->1
Server 2 : URL B->2 URL G -> 1
Server 3 : URL C -> 2 URL G -> 1

Wouldn't the above algo give URL A, B or C as the top visited URL where as the actual top URL should be G?

- Izaaz on October 07, 2013 | Flag
0
of  0 votes

Hi Izaaz, in your example the first pass finds the critical threshold T to be 2. The second pass would then calculate T=2 devided by S=3 and ask all servers for all URLs that have a score >= 2/3. In other words it would merge the complete set of URLs and thus get the URL G and the sum of the accesses to it in the second pass.

- laperla on October 07, 2013 | Flag
0
of  0 votes

Sorry if I miss something. But I don't think it's going to yield the correct result by selecting Top N with or without threshold. Considering the following case:

Server A
Y1 - 11
Y2 - 11
Y3 - 11
Y4 - 10
Y5 - 10
Y6 - 9
Y7 - 9
Y8 - 9
Y9 - 9
Y10 - 9

G1 – 4

Server B
M1 - 12
M2 - 12
M3 - 11
M4 - 11
M5 - 10
M6 - 10
M7 - 10
M8 - 10
M9 - 10
M10 - 10

G1 - 9


The threshold 10 / 2 = 5. 

In the second pass, G1 in server A won’t be included for tally. In fact, G1 with total 13 visits could be the top 1. But it does not even get into top 10 based on the method. Do I miss something?

- aka777 on October 12, 2013 | Flag
0
of  0 votes

@aka777, G1 is not part of the top of server A but it will be part of the top of server B with visits >= 10 / 2. So, the algorithm will ask for all G1 occurrences in other servers and it will correctly put this as top1.

it will yield the correct result, with possibly the cost of multiple rounds.

- Miguel Oliveira on October 12, 2013 | Flag
0
of  0 votes

Thanks, Miguel. I got it. 
Theoretically, the algorithm will yield correct result. But Google has more than one million servers. I don't know how this is gonna work out. (Threshold is gonna be very low like vik said.)

- aka777 on October 12, 2013 | Flag
7
of  9vote

The constraints puzzle me a bit, especially the "using MapReduce directly, is not allowed" one. I would try to discuss what that means exactly in an interview. 
I'll give another shot at the question: 

Denote N as the number of computers in our network. 

1) Pick a good string hash function. This function must take urls as input and produce hash values uniformly in the range [0, MAX_HASH_VALUE] 

2) Divide the hash range [0, MAX_HASH_VALUE] into N intervals with equal length MAX_HASH_VALUE / N, and assign each interval to 1 computer in the network, e.g. 
CPU_1) [0, length-1] 
CPU_2) [length, 2*length-1] 
CPU_3) [2*length, 3*length-1] 
... 
CPU_N) [(N-1)*length, MAX_HASH_VALUE] 

3) Each computer computes the hash values of its list of urls and sends the url and its visited information to the computer responsible by the hash of that url. 

4) Each computer received a list of information url->visits for the urls in its hash interval. Now it must combine the visits of the urls and produce its top 10. 

5) Each computer sends its top 10 to a central computer. This central computer will receive 10*N urls and compute the overall top 10. 

Due to our good hash function, we expect that each computer will receive roughly the same amount of urls to process in the 4th step. I think this approach is the best we can do to distribute the work among the cluster. 

About the constraints, they're not very clear. 
a) This is sending all urls over the network but not to a single computer. In order to produce an exact result, I think all approaches end up sending all urls over the network in the worst case. Again, we would need to discuss with the interviewer if this is ok (the "especially sending all of them to a central server" part). 

b) This is similar to MapReduce. I think that by saying "using MapReduce directly is not allowed", the interviewer meant that we have to give a detailed explanation about how the work is distributed among the network of computers, instead of just saying "the MapReduce framework will distribute and combine the work for us".

- Miguel Oliveira on October 01, 2013 | FlagReply
0
of  0 votes

I think this is the only reasonable solution. The word "especially" in the phrase "especially to a central computer" seems to imply that sending the maps in a distributed manner might be acceptable, which is the only constraint this solution violates. It is the only solution that produces the correct result and is guaranteed (modulo the goodness of the hash function) to not send all the urls to the same server in the worst case.

- Anonymous on October 09, 2013 | Flag
0
of  0 votes

I think so as well. 
Someone -1 all my replies to this and a few other threads, for some reason, and I guess this reply was kind of forgotten.

- Miguel Oliveira on October 09, 2013 | Flag
0
of  0 votes

As you said , you are sending all of the urls on the network , which I guess is impossible as it's been said in the question. It works but it doesn't conform to the constraints.

- Arian on October 15, 2013 | Flag
0
of  0 votes

This is the best solution I can come up with. It is effectively distribute the computations and traffic over the network. It does not need a central server.

- sztaoyong on November 08, 2013 | Flag
2
of  2vote

I don't think any of the suggested solutions are right. It is possible to imagine a situation where some url is ranked number 11 on every box, and so has very high visits overall, while every url in the top 10 on each individual box is seen nowhere else, so has low visits overall. 

That said, I don't have any better ideas. This problem is hard!

- Anon on September 26, 2013 | FlagReply
0
of  0 votes

The first pass gives you a lower bound. In the second pass, you will include that 11th url in your candidate set, since its overall visits is certainly greater than the 10th url found in the first pass.

- Yong on November 08, 2013 | Flag
0
of  0vote

I understand that with the given constraints it is not possible to get a trivial solution. but i thik we have to consider the senario where one url in the actual 10 top urls is visited by most of the machines but only a few times that will keep it out of the individual top 10 lists of each machine. 

does that sound right?

- rajesh on September 26, 2013 | FlagReply
0
of  0 votes

I agree : This example would fail 

Consider this example. Lets find the top 1 URL for now and we can extrapolate the question for top 10 URLs too.

3 Servers
Server 1 : URL A-> 2 URL G->1
Server 2 : URL B->2 URL G -> 1
Server 3 : URL C -> 2 URL G -> 1

Wouldn't the above algo give URL A, B or C as the top visited URL where as the actual top URL should be G?

- izaazyunus on October 07, 2013 | Flag
0
of  0vote

You could possibly continually request the top urls until you have ten where the smallest value in the top10 list is not more then the highest value in any server using paginated sets. For example,:

page = 1;
max=10;
while(top.length <max || top.smallest < highest && page <maxpages){
        highest = -1;
	for(server in farm){
            temp = getTop(server,max, page);
            merge(top, temp, max);
            if(highest < temp.top){
               highest = temp.top;
           }
        }
      page++;
}

Merge would keep no more than max values based on unique url and highest visits.

- Reallistic on September 26, 2013 | FlagReply
0
of  0vote

I would recursively group the servers into sets of two and aggregate the URL's . 
Suppose there are 6 servers. 
a) Group s1s2, s3s4, s5s6. 
b) We know the entire map cannot be transmitted. Find out a safe message size for the network. Supposing n. Break the map in s1 into n size chunks and send over to s2. 
Similarly from s3-s4, s5-s6. 
c) This is the tricky part. The ques says we cannot do map-reduce directly. Does that mean map-reduce on the entire set? Is it allowed for individual machines? But it will be silly to solve this without ever getting a count of the URLs. So if map reduce is not allowed, write a procedure to sort the urls and track each one's count. This is done in s2, s4 and s6. 
d) Now, again group the machines. This time {s2s4}, {s6}. 
e) Repeat b, c . We will have s4, s6 left. Transmit from s4-s6 and perform a final count.

- Aparna on September 27, 2013 | FlagReply
0
of  0vote

Lets assume, we can store count of k urls in the central machine. If the corpus is of size n, we get the top n/k elems from each server and send it to our central machine. 

So, in our central machine, we have a set of k counters each initialized to 0 to start with. As we get our data from the stream, for every url, if it exists, we increment the counter. If not: 
case1: if there is size in our central machine to add the url, we add it and set its count 
case 2: if not, we delete the count of every url we have in the central machine. If there is any 
url that has a count of 0, we delete it. 

Now, the moment we hit case 2 above, we record the max_count_so_far and take a snapshot of the top 10 elements. 

We process the next set of top n/k elements from the machines and for every max_count_so_far elements we take a snapshot of the top 10 elements. 

At some point, say after we have 10 such snapshots, we find the final top 10 elements from the snapshots we have so far

- Anonymous on September 27, 2013 | FlagReply
0
of  0vote

On each server, we sort the urls based on their frequencies. 
Say total log file lines across all servers are N. Number of servers is s. Capacity of server is k. Now split the ranked urls such that there are k/s elems per group on each server. Label each group , there will be total N/(k/s) ids. 

Now, from this set of ids, we randomly select s ids (i.e s groups) such that each id doesnt occur more than threshold times on one machine. (To keep it simple lets say threshold=1). 

Now, we employ the following algorithm on the central machine: 
If there is size in our central machine to add the new group, we add it and set the count of each of the elems in the group. If not, we delete the count of every group/element in the group. We delete all urls that have a count of 0. 

Now, one might argue that we might endup spending too much time on low frequency entries. We could then employ multiple iterations of elimination here. In the first pass, we can only consider frequencies that are above a certain threshold: say frequency above the median of the frequencies in each server. And divide those set of urls into groups. And consider random group from each server.

- Anonymous on September 27, 2013 | FlagReply
0
of  0vote

Lets say the number of servers is N. I think the solution requires 10 passes among the top ten scorers of each server. In each pass you can only identify one in the top ten. After each pass, the selected URL from the previous pass must be excluded from the evaluation in the subsequent passes, and the top ten scores from each server must be updated. 

Think of the first pass. When all the top ten lists are considered, there must be at least one URL among these top tens which will land into the "real" top ten list. It is possible to generate scenarios where 9 of the real top ten list do not appear in the "current" top ten lists. But, there has to be at least one URL in the real top ten list which also appears in the 10N URLs that are collected from the servers' current top ten lists. Note that this is true only when N>=10. If N<10, all the real top ten list may not appear in the 10N URLs. 

Also note that after collecting the 10N URLs, for some of them, you will have to ask the servers their frequencies so that you can sort the 10N URLs properly. Because, you want to sort the sum of visit frequencies of these URLs at each server. 

After 10 such passes the real top ten list will emerge. The messaging complexity of each pass could be as bad as N^2 since you may have to collect the frequency values for each one of the 10N URLs. But, the computational complexity of each step is O(N) since you only need to find the max of 10N URLs. 

If we were to design an algorithm that can find out not just 10 but the top K list. Then, we would be have to collect the top N scorers from each server .So, rather than 10N, we would bring together the top N^2 list at each pass. If K>N, this design is preferable.

- yuksem on September 28, 2013 | FlagReply
-1
of  1 vote

Here's a counter example where the top 3 does not appear in the top 3 of any of the 3 servers: 

s1= { a:10, b:9, c:8, d:7, e: 6, f: 5 }, 
s2= { g:10, h:9, i:8, d:7, e: 6, f: 5 }, 
s3= { j:10, j:9, l:8, d:7, e: 6, f: 5 } 

the top 3 is {d: 21, e: 18, f: 15 }, so that approach also does not guarantee 100% correctness

- Miguel Oliveira on September 28, 2013 | Flag
0
of  0 votes

Good example..

In the first pass, it will find a (or g or j). In the second pass, d will be in the first three of s1 since a is kicked out of s1's top three. So, the second pass will find d. The third pass will, in turn, find e. I can see, however, that f will not be included in the top three list after three passes.

We can update the algorithm as follows:

After a pass, if a new fellow is selected that places into an unexpected spot (i.e., item found in the i th pass places into a spot < i th place in the top items list), then we increase the total number of passes by one.

This should do it..

If not, I think making N^2 passes should do it.

- yuksem on September 28, 2013 | Flag
0
of  0 votes

Actually, even top N^2 may not do it.. :) Here is one more update to the algorithm I proposed:

Find the max value among all the lists: This should be easy to do. Findmax for all lists. Let's say that the max value is M.

Then, apply the iterative algorithm (with the extra check I described in my prev post) among the top lists of each server such that the top lists include all the values greater than or equal to M/N.

- yuksem on September 28, 2013 | Flag
0
of  0 votes

Well, top N^2 lists will do it. :) But, cutting the top lists at M/N threshold will perform better on the avg case.

- [email protected] on September 28, 2013 | Flag
0
of  0vote

the logic is similar to tetris game algorithm. Or generalize voting algorithm.

- sfsh on October 02, 2013 | FlagReply
-1
of  1vote

The easy solution to solve this kind of problem is MapReduce. Since MapReduce is not allowed, the other alternative is to sort the string (url) ==> int (visits) in each machine independently such that it is the increasing order of visits. Then each server can send their top 10 visited URLs, Visit mapping to one of the server. 

This single server would receive data from all the others for the top 10 visited sites and do a merge to decide on the top 10 sites. ( A n-way merge sort of the URL vs visits).

- Vs on September 26, 2013 | FlagReply
1
of  3 votes

That's not correct. There could be a URL that is not in the top 10 for any one server, but is in the top 10 overall. See tony's response to Miguel Oliveira's answer.

- eugene.yarovoi on September 30, 2013 | Flag
-1
of  1vote

edit: this is does not guarantee 100% correctness 

The question says we can't use MapReduce directly because the amount of data is too large. However, the overall top 10 sites are *expected* to be in the top 10 of at least one of the computers. 

Hence, we can sort the logs of each computer and emit the top 10 of each computer in the mapping phase. Then, the reduce phase aggregates the number of visits for each site. The result will be the top 10 of the aggregate.

- Miguel Oliveira on September 26, 2013 | FlagReply
1
of  1 vote

This is wrong. Imagine the most frequent terms are

a1={url1:100, url2:99, url3:98, url4:97 },
a2={url5:100, url6:99, url7:98, url4:97 }

Then the top three most frequent in the merged list should contain url4 as the top one though url4 is not the top three in any of the original lists.

- tony on September 26, 2013 | Flag
-1
of  1 vote

yeah, Anon gave an example in his answer. These types of questions require some discussion with the interviewer. I don't see a way to do it in a very efficient way given the constraints, maybe those constraints were not so strict as it seems.

- Miguel Oliveira on September 26, 2013 | Flag
-1
of  1vote

What if we manage a MaxHeap of top 10 sites at each node separately, and a heap of size 10 in central unit. The central heap can be updated according to :

if(heap[0]<RecentlyReceievedData)
{
   heap[0]=RecentlyReceivedData;
   MaxHeap(heap,0);
}

Then at every pre-specified interval, we can request updates from the nodes whose responses are the content of their respective MaxHeap.

- Anonymous on September 26, 2013 | FlagReply
-1
of  1 vote

the aggregated top 10 is not necessarily in the top 10 of a individual server. so it won't guarantee correctness

- Miguel Oliveira on September 28, 2013 | Flag
-1
of  1vote

have a centralized counter. every time a new hit is recorded, recalc the centralized counter for that url with properly synced data structure. then compare it with the top10 hit stack by popping out the ones smaller than the current count result. keep updating the top 10 stack every time the new hit is recorded and you can always query the top 10 stack for the top 10 hits.

- automan on September 27, 2013 | FlagReply
-1
of  1 vote

I think that violates this "the maps are too large to transmit over the network (especially sending all of them to a central server "

there are too many visits to centralize a counter in one machine

- Miguel Oliveira on September 28, 2013 | Flag
-1
of  1vote

We could use a BFS, visit every node and add sum of the integers to the master map that has all the URL's checking, if the URL is already present, then just add the integer orelse create a new key and put it into the map. Then traverse through the map to find the top ten values or just sort descending and return the first 10 values.

- AVK on September 27, 2013 | FlagReply
-1
of  1 vote

"the maps are too large to transmit over the network (especially sending all of them to a central server (..)"

can't go that way

- Miguel Oliveira on September 28, 2013 | Flag
-1
of  1vote

There are N number of machines, 
Determine top 10 in every machine. 
Each machine transmits its top 10 to every other machine i.e (n-1)10 urls 
This Implies each machine also recieves (n-1)10 urls 

Process these and determine the top 10 in every machine. 

Each machine now sends its Top 10 to a single place,(n*10) 
The top 10 among these (n*10) will be the most visited URL's 

I do not know if there is any way to avoid not replicating the data on each machine. But will N*(N-1)*10 url's be too much traffic for the network to handle, cos that will be the total number of replications required. 

Other possible solutions I could think of was using P-S pattern to publish count of URL's periodically.

- iGottaBeKidding on September 28, 2013 | FlagReply
-1
of  1 vote

that's quite similar to the approach i posted before. this does not guarantee 100% correctness. the aggregate top 10 does not need to be in the top 10 of any individual server.

- Miguel Oliveira on September 28, 2013 | Flag
0
of  0 votes

My bad, had not read any answers, before replying.

- iGottaBeKidding on September 28, 2013 | Flag
-1
of  1vote

1. tag the nodes as n1, n2, n3..... nk 
2. First n1 sorts its list of URLs to find top10 
3. n1 sends this list to n2. n2 adds this list to the data set, and gets a top 10 
4. now n2 sends its top10 (calculated in step 3) to n3 
5. keep doing it till we reach nk 

nk will have the cumulative top 10

- jVivek on September 28, 2013 | FlagReply
-1
of  1 vote

it will not. check anon and tony's posts above.

- Miguel Oliveira on September 28, 2013 | Flag

你可能感兴趣的:(系统设计)