Job1:
Map:
input: (document, each line of the document) # TextInputformat
output: (word@document, 1)
Reducer:
output: ((word@document), n)
n = sum of the values of each key(word@document)
the implicit process is:
the same key(word@document) will be pushed to the same reducer(in the shuffer phase)
Job2:
Map:
1、input: ((word@document), n)
2、Re-arrange the mapper to have the key based on each document
3、output: (document, word=n)
Reducer:output: ((word@document), n/N)
N = total wordsInDocs = sum[word = n] for each document
Job3:
Map:
1、input: ((word@document), n/N)
2、Re-arrange the mapper to have the word as the key, since we need to count the number of documents where it occurs
3、ouput: (word, document=n/N)
Reducer:
ouput: ((word@document), d/D, n/N, tfidf)
D = total number of documents in corpus, which can be set in the configuration
d = number of documents in corpus where the word appears
TFIDF = n/N * log(D/d)