讲解:CS 130A、PrintHashTable、CS/Python、Java/PythonPython|Da

CS 130AAssignment II - The Hottest of Them AllAssigned: October 18th, 2018Due On Demo Day: November 2nd, 2018PLEASE NOTE: Solutions have to be your own. No collaboration or cooperation among students is permitted. 5% of the points will be deducted for each day the assignment was late, with a maximumof 4 days Requests for a regrade must be submitted within seven days from the day when wereturn the assignment.1 IntroductionThe heavy hitter or top-k problem is a famous problem in computer science. Typically usedin settings where a stream of data (sensor data, click stream, advertisment requests, etc)is being analyzed to detect the most popular items. An easy solution is to keep a counterfor each item in the stream, and increment the counter every time the corresponding itemis received. Unfortunately, often the domain of items is too large to maintain a counter foreach element, eg, the domain might be all IP addresses. In this case, we typically maintaina smaller finite set of counters. So, for each new item, we create a counter which correspondto that item. The counter is incremented every time this item is encountered. This worksfine until the set of counters is full. We need to eliminate one counter and add a newone. Usually, the element with the smallest count is deleted. Different strategies have beenproposed regarding the initialization of the new counter. One very successful strategy is toinitialize the new counter with the value of the deleted (smallest count) counter. Of coursethe resulting top-k elements are an approximation of the real top-k, but it has been shownto be quite accurate for realistic distributions. In this assignment, we will use this strategy.12 Basic Data StructuresIn this assignment, you will implement a simple version of the heavy hitter problem, wherewe would like you to find the most popular words in a document. You will be given a .txtfile, which is an article containing words. You will read the words one by one from the fileand at the end identify an approximation of the most popular, ie, most frequent, 15 words.Given that we often need to retrieve (and delete) the word with the smallest frequency, amin heap is a natural choice. However, finding an item in a min heap is not easy. Therefore,you need an additional data structure to easily retrieve each element in the heap. For this,use a Hash table.A binary min heap is a complete binary tree which satisfies the following min heapordering invariant. the min heap invariant: the value of each node is greater than or equal to thevalue of its parent, with the minimum-value element at the root.A min heap can be uniquely represented by storing its level order traversal in anarray as shown in Figure 1. Consider the kth element of the array, its left child is located atindex 2*k,its right child is located at index 2*k+1 and its parent is located at index k/2.Figure 1: min heapIn this assignment, you will implement a min heap using an array object where eachitem in the heap contains the frequency of a word.In order to keep track of the words corresponding to the frequencies in the min heap,you will implement a hash table. Each entry in the hash table contains the word hashedto that location as well as a pointer to the corresponding entry in the min heap (since the2min-heap is implemented using an array, this is simply an index in the array). We leave it toyou to decide whether to use a chaining or a probing hash table, as well as the hash functionand the details of collision handling. However, since the minimum word in the min heap willbe deleted and replaced, you need to support deletion in the hash table (as well as insert).Every time a word is read, it is looked up in the hash table, there are three cases to consider:1. If the string already exists, increment its frequency by one in the min heap and percolatedown to the correct place in the min heap while updating all the correspondingpointers in hash table.2. If the word does not exist in the hash table and the min heap is not full, the newword is inserted into the min heap and is ini代做CS 130A作业、代写PrintHashTable作业、代做CS/Python编程作业、代写Java/Pythontialized to a frequency of one. Furthermore,the word is inserted into the hash table with a pointer to the corresponding frequencyin the min heap (the root in this case).3. If the word does not exist in the hash table and the min heap is full, retrieve theexisting word with the minimum frequency, delete it from the hash table and replaceit with the new word while keeping the frequency as before. The new word is alsoinserted in the hash table with a pointer to the corresponding frequency in the minheap.NOTE Your algorithm should be case insensitive, meaning that, for example, ”he” and ”He”should be treated as the same string. We only care about strings that are words in the file, meaning that, blank space andcommas and etc, should not be considered as strings. The algorithm will give an approximation of the most frequent 15 strings. The size of the heap(array) should always be 16 because we keep the index 0 empty. Figure 2 shows a high level sketch of the two data structures used in this assignmentand how the hash table needs to have pointers to the corresponding entries in the heap.3 Implementation detailsAs a part of this homework, you will implement 4 functions as explained below: Insert:This function will take a string as an input.3– If the newly inserted string already exists in the hashtable, then first locate theposition of the string in the min heap using hashtable, then update the frequencyof the string in the min heap, percolate the element to its correct place in thearray, and lastly update the hashtable to point to the updated position.– If the newly inserted string does not exist in the hashtable, check if the minheapis full or not. If it is full, then simply replace the root entry of the heap withthe newly inserted word and keep the frequency, and then update the hashtable(delete the old word and insert the new word in the correct place using the hashfunction). If the minheap is not full, then insert the string to the heap, andthen update the hashtable. (This is essentially achieved by implementing theReplaceMin function as explained below.) ReplaceMin: This function will be called when the newly inserted string does not existin the hashtable and the minheap is full. Replace the first element of the array (indexone of the array), which has the lowest frequency, with the newly inserted string andupdate the hashtable (i.e, you have to locate the string in the hashtable, and thendelete the entry, and use the hash function to place the newly inserted word in thecorrect place). PrintHeap: This function will print out the most frequent 15 words associated withtheir corresponding frequencies. PrintHashTable: This function will print out the current hash table.3.1 Program FlowNOTE: We do not want any front end UI for this project. Your project will be run on theterminal and the input/output for the demo will use stdio. The file name will be provided asan input to your program. After running your program, we will ask you to call the PrintHeapfunction, which will print out the 15 most frequent strings associated with its correspondingfrequencies, and PrintHashTable function, which will print out the whole hash table. Andthen we will interact with your program (i.e. we will let you call insert(tree)), and thenyou will be asked to call PrintHeap and PrintHashTable. This process might be repeatedmultiple times during the demo.3.2 Extra Credit and Sanity CheckGiven that there are about 250,000 words in English, it is not so unreasonable to maintainan array of size 250,000, where each entry maintains the frequency of the corresponding wordin the file you are analyzing. As extra credit, you can try and figure out a way to maintainthe frequency of all words in the given file and then retrieve the most frequent 15. Comparethem to what your approximate fixed size min heap solution gives.44 DemoWe will have a short demo for each project. It will be on November 2nd, 2018 in CSIL.Time details will be announced later. Please be ready with the working program at the timeof your demo.Figure 2: example转自:http://ass.3daixie.com/2018110427688141.html

你可能感兴趣的:(讲解:CS 130A、PrintHashTable、CS/Python、Java/PythonPython|Da)