软工作业开始由于经过好几次改动,所以总体用于优化的时间有一天多的时间,有点囧,一开始的作业说明没有看清楚,搞得绕了好多弯子,比如,liz12和 Liz23同时出现的话在extended_mode里面是输出哪一个呢,这个问题很纠结,如果按extended_mode的要求,这是看成一个单词还 是两个不同的单词就有歧义了,如果看成是一个单词那么输出时输出哪一个呢。特别无语啊,最后班里面统一了一下标准,就是输出Liz23,因为按字母序,L 在l前面。
第二个就是时间优化问题了,之前写得都是小程序,处理数据量小,自己设计的测试数据也是比较小的一些数据,看不出运行时间的差距来,后来发现运行大数据时 运行时间差距好大了,就是之前我的随笔中所写的13分钟的程序,那个测试文件也就134M,可是运行时间还是要这么久。当时特别绝望,不知道怎么能提高算 法的效率,有好几次更改代码还越改效率越低,奔溃之中。。。最后发现,自己之前都是只会用数组遍历来寻找一个数据,因此采用的数据类型也是C#里的 List,每一次插入一个新的数据,如果要知道这个数据在原来数组里有没有出现过,就要遍历一遍List里的数据,这时候在最坏情况下是,遍历到最后,发 现这个新数据没有在原数组里出现过,好耗时间,如果有n个不同和数据,那么在最坏情况下就要花费n(n+1)/2的时间,时间复杂度为O(n*n)这是一 个非常差劲的算法了。对于大数据,我们可以运用哈希表来查找数据,哈希表查询数据的时间复杂度为O(1),同样是n个不同数据,时间复杂度降为O(n) 了,于是我改用了C#里的Dictionary这个东西,时间立马就降下来了,只需30秒就能完成原来的134M的数据了,真是一个天一个地啊。。。
下面补一下作业的说明吧:
Implement a console application to tally the frequency of words under a directory (2 modes).
For all text files under a directory (recursively) (file extensions: "txt", "cpp", "h", “cs”), calculate the frequency of each word, and output the result into a text file. Write the code in C++ or C#, using .Net Framework, the running environment is 32-bit Win7 or WinVista.
Run performance analysis tool on your code, find performance bottlenecks and improve.
Enable Code Quality Analysis for your code and get rid of all warnings.
Code Quality Analysis: http://msdn.microsoft.com/en-us/library/dd264897.aspx
Write 10 simple test cases to make sure your program can handle these cases correctly (e.g. a good test case could be: one of the sub-directories is empty).
Submission:
Definition:
“hao123” is a word, and “123hao” is NOT a word.
<word>: number
Where “number” is the number of times this word appears in the scan. The output should be sorted with most frequently word first. If 2 words have the same frequency, list the words by alphabetical order.
Requirements:
1) Simple mode. Output simple word frequency.
Myapp.exe <directory-name>
Will output <your-name>.txt file in current directory, the text file contains word ranking list.
2) Extended mode.
This only applies to some special cases of words. If 2 words are different only in the ending numbers, we think they are the same number. For example, we consider “win”, “win95” and “win7” are ONE WORD; “Office” and “Office15” are the same; “iPhone4” and “Iphone5” are the same word. “win” and “win32a” are DIFFERENT words, as the difference are more than just ending numbers. “21century” and “century” are DIFFERENT words too.
When running with “-e” command line parameter,
Myapp.exe –e <directory-name>
The app will output <your-name>.txt file in current directory, the text file contains word ranking list, but the frequency is calculated based on the extended mode definition.