Information Retrieval Experiment
Question 1:
Inverted Index:
IR system sorts terms term wise (a-z) and gives every term an index (position) in the term list, which supports identification of documents containing a given term. As aresult, in the Boolean IR, by using document matrix of each including a termindex and the address of the term, it determined the binary value of Boolean IR with a given term.
So this is what about inverted index.
Ranking Function (e.g. TF-IDF):
Anadvanced IR system usually weights a term according to its importance. So how do we weight one term? How do we know it’s important or not? At that time, TF and IDF work.
TF (Term Frequency) means how many times the term occur in individual document, which isused top describe how important the term is for the individual document. Asimplest method is raw frequency:
Tfdj = numj.
However,the drawback is obvious, the longer document is, the higher frequency of oneterm is. So, we usually do normalization for document length. So, we use anothersimple formula to solve this:
Tfdj = Numj / Numw.
Where Num (w) is the number of words in the document (d) and Num(j) is the number of the term in document (j).
IDF(Inverse Document Frequency) means how many times this term occur in all of the documents, which is used to allocate an importance weight to words in the document collection. As we see, if a word occurs so often in all of the document collection, this word can not be so important. In other words, a word,which occurs not so often, can be a betterdiscriminator term.There is a simple but effective formula for calculating IDF:
IDFj=log(N/dfj).
WhereN is the number of document in collection and dfj is the number of document containterm (j).
So far, we can combine TF and IDF to generate the weighting of every unique term.
Wdj = Tfdj * log(N/dfj).
And according to TF and IDF score, every document ranks in the decreasing order of relevant and present result to users.
Question 2:
What’s stop words?
A word occursso often in every document, we call this word stop word. And stop word also means a word which can not be a discriminator term. In order to smaller the index, wecould remove this by using predefined stop word list. However, if we do this, it will be difficult for search engine to deal with a query like ”to be or not to be”.
Why ranked retrieval is better than Boolean retrieval?
Boolean retrieval tests are there any words is same with query, that’s to say the answer is always yes or no (1 or 0). Thus, there are some limitations of it:
1. Boolean retrieval can not handle complex query.
2. No differentiation among terms for Boolean retrieval.
3. Don’t know which result is more relevant with user’s query.
For all of those reasons above, we can not use it in the WWW. However, for ranked retrieval, it solves all of these problems. It distinguishes different importance of every term, every image and every link and gives out the score of those by relevance module. What I have described above, which is about TF-IDF,is good example for the advantage of ranked retrieval.
Question3:
For Chinese, there are two big search engines for us to choose, which are Baidu and Google.
But mostly,I use Google for searching, and usually for WWW page.
There are three reasons for me to choose Google for searching information.
1. Supporting language. As a university student, I always need some English material.However, Baidu just support Chinese search while Google can support both Chinese and English.
2. Technique level. For Baidu search engine, what it does is just dividing every document to many classes and matching user’s queries with the document. Differently, for Google, it will get your browsing history, judging what kind of information you need and what type of document you want.
3. Result.For Baidu, whose market is mostly in China, something politically sensitive is usually hidden and can not be found in searching content. In that part, Google is totally different.
In the future,everywhere will have search. And those are features what I like to see in the future:
1. Individual Serve. Everyone has their own database, and based on this database, everyone can get individual searching results.
2. Artificial Intelligence. Search engine totally know what users want, even can communicate with users.