【杭电2015年12月校赛H】【模拟 STL-MAP STL-SET stringstream】Study Words 从article中提取中10个没有学过的频率最高单词


Study Words

Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)
Total Submission(s): 226    Accepted Submission(s): 80


Problem Description
Learning English is not easy, vocabulary troubles me a lot.
One day an idea came up to me: I download an article every day, choose the 10 most popular new words to study.
A word's popularity is calculated by the number of its occurrences.
Sometimes two or more words have the same occurrences, and then the word with a smaller lexicographic has a higher popularity.
 

Input
T in the first line is case number.
Each case has two parts.
<oldwords>
...
</oldwords>
<article>
...
</article>
Between <oldwords> and </oldwords> are some old words (no more than 10000) I have already learned, that is, I don't need to learn them any more.
Words between <oldwords> and </oldwords> contain letters ('a'~'z','A'~'Z') only, separated by blank characters (' ','\n' or '\t').
Between <article> and </article> is an article (contains fewer than 1000000 characters).
Only continuous letters ('a'~'z','A'~'Z') make up a word. Thus words like "don't" are regarded as two words "don" and "t”, that's OK.
Treat the uppercase as lowercase, so "Thanks" equals to "thanks". No words will be longer than 100.
As the article is downloaded from the internet, it may contain some Chinese words, which I don't need to study.
 

Output
For each case, output the top 10 new words I should study, one in a line.
If there are fewer than 10 new words, output all of them.
Output a blank line after each case.
 

Sample Input
   
   
   
   
2 <oldwords> how aRe you </oldwords> <article> --How old are you? --Twenty. </article> <oldwords> google cn huluobo net i </oldwords> <article> 文章内容: I love google,dropbox,firefox very much. Everyday I open my computer , open firefox , and enjoy surfing on the inter- net. But these days it's strange that searching "huluobo" is unavail- able. What's wrong with "huluobo"? </article>
 

Sample Output
   
   
   
   
old twenty firefox open s able and but computer days dropbox enjoy
 


#include<stdio.h>
#include<iostream>
#include<sstream>
#include<algorithm>
#include<ctype.h>
#include<string.h>
#include<vector>
#include<set>
#include<map>
using namespace std;
int casenum,casei;
typedef long long LL;
const int N=105;
int n,m;
char s[N];
char oldwords[]="</oldwords>";
char article[]="</article>";
set<string>sot;
map<string,int>mop;
map<string,int>::iterator it;
const int L=1e6+10;char ss[L];
vector<pair<int,string> >b;
int main()
{
    scanf("%d",&casenum);
    for(casei=1;casei<=casenum;++casei)
    {
        sot.clear();mop.clear();
        while(1)
        {
            scanf("%s",s);
            for(int i=0;s[i];++i)s[i]=tolower(s[i]);
            if(!strcmp(s,oldwords))break;
            sot.insert(s);
        }
		scanf("%s",s);getchar();
        int l=0;
        while(1)
        {
            gets(ss+l);int len=strlen(ss+l);
            if(!strcmp(ss+l,article))break;
            for(int i=l;ss[i];++i)
            {
				if(!isalpha(ss[i]))ss[i]=' ';
                else ss[i]=tolower(ss[i]);
            }
            l+=len;
            ss[l++]=' ';
        }ss[l]=0;

        stringstream cinn(ss);
        while(cinn>>s)
        {
            if(sot.find(s)==sot.end())++mop[s];
        }
        b.clear();
        for(it=mop.begin();it!=mop.end();++it)
        {
            b.push_back(make_pair(-it->second,it->first));
        }
        sort(b.begin(),b.end());
        for(int i=0;i<min(10,(int)b.size());++i)cout<<b[i].second<<endl;
        puts("");
    }
    return 0;
}

/*
【trick&&吐槽】
1,这题也是读错题了= = 如果一行的末尾就是英文字符,我们是需要把换行符也给加进来的,
	否则会和下一行的连成一个单词。比赛的时候就是因为这个傻叉错误,浪费了一个小时+4次罚时。。。

2,中文字符的特点是,连续两个字符的Ascii码都为负数
3,我怕strcmp慢,手写了比较函数...
4,这题数据一定很水,都没有下面这样的数据干扰>_<
	<oldwords>
	</oldwords>
	<article>
	/article
	/article>
	</article
	</article>

【题意】
对于每组数据,数据都是以下的形式
<oldwords>
...
</oldwords>
<article>
...
</article>

对于article中的单词,我们要找出10个出现频率最高的英文单词
要求:
1,oldwords中的单词是我们已经学会的,以后就不用再学习了。
2,不区分大小写
3,忽略中文符号
4,凡是不连在一起的英文单词,哪怕是don't 都要拆分成两个单词don 和 t来处理
5,如果频率相同,以字典序小的优先
6,如果不到10个,按照(频率,字典序)的双关键字标准,有几个单词就输出几个。
7,每个单词的长度不超过100
8,article的长度不超过1e7

【类型】
模拟 STL-SET STL-MAP

【分析】
我的做法是这样的——

1,先转小写。
2,SET存所有需要去除的单词
3,提取中所有单词
	具体实现,有方便的技巧。比如我们可以用——
	(1)stringstream cinn(s)
	(2)scanf(%[^])
4,MAP中记录所有单词的频率
5,把MAP中的所有单词,按照(频率,字典序)排序,输出前10个即可。

【时间复杂度&&优化】
O(1e6 log(1e6))

0msAC,说明数据真的很弱= =

*/


你可能感兴趣的:(模拟,STL-set,系统函数研究,STL-map)