这道题只拿了24/35分,我猜问题出在stemming函数里,但是暂时不知道该怎么处理。
#include
#include
#include
#include
Plagiarism is a form of academic dishonesty. To fight with such misconducts, plagiarism checkers are developed to compare any submitted article with all the articles stored in a database. The key is that given two documents, how to judge their similarity, or in other words, document distance? The document distance can be measured with their word metrics. In this project, you are supposed to write a program to calculate the distance of any given pair of documents.
Some important terms are defined as follows:
Word
: is a continuous sequence of alphanumeric characters. For example, the phrase "the course data structure is fun" consists of 6 distinct words. Here we can assume that no word contains more than 20 characters.
Word frequency
: denoted by FD(w) is the number of times the word w occurs in a document D.
Document distance metric
: is defined to be the inner product of the two vectors containing the word frequencies for all words in two documents. Denote the frequency vector for document Di by F(Di)=(Fi(w1),⋯,Fi(wn))T, then the metric is given by (F(D1),F(D2))=∑i=1nF1(wi)F2(wi). In other words, the document distance metric is the projection of vector F(D1) onto F(D2), or vice versa.
Angle metric
: is the angle between the vectors F(D1) and F(D2) which gives an indication of overlap between the two documents D1 and D2. The angle can be computed by the following formula:
θ(D1,D2)=arccos(∣∣F(D1)∣∣⋅∣∣F(D2)∣∣(F(D1),F(D2)))
where θ∈[0,π/2] and ∣∣⋅∣∣ can be any norm of a vector. To be more specific, here we use the 2-norm which is defined by ∣∣F(D)∣∣=(F(D),F(D)).
Each input file contains one test case. For each case, the first line of input contains a positive integer N
(≤100) which is the number of text files to be processed. Then N
blocks follow, each contains a file. The first line of each file contains the title of the article (string of no more than 6 characters without any space), and the last line contains a single character #
. After the file blocks, there is a line containing a positive integer M
(≤100,000), followed by M
lines of inquiries. Each inquiry gives a pair of file names separated by a space. The size of the largest test case is about 1 MB.
For each inquiry, print in a line Case #: *
, where #
is the test case number starting from 1 and *
is the document distance value accurate up to 3 decimal places.
Note: Your stemming algorithm must be able to handle "es", "ed", "ing", "ies" and must be case-insensitive. Stop words are not supposed to be ignored and must be treated as normal words.
3
A00
A B C
#
A01
B C D
#
A02
A C
D A
#
2
A00 A01
A00 A02
Case 1: 0.841
Case 2: 0.785