网络数据挖掘 L1-L3 Indexer&Search

title: 网络数据挖掘 L1-L3 Indexer&Search
date: 2017-03-28 15:39:09
categories: DataMining
mathjax: true
tags: [WebDataMining]

L1 Instruction

略略略

L2 Architecture and Spiders

网页、网站架构、搜索引擎等

image.png

homework: Spider

L3 Indexer and Search

Indexer

image.png

Tokenizer
- Tokenization
- Stopping (a, the, this, that, of, etc.Stop word list)
Stemmer词干器（Porter stemming algorithm）
- Root: fish
- Words from the root: fishing, fished, fisher
Lemmatization词形还原（BE for am, is, are, was, were, been）
Spell Checking or Correction
- Example: s1=“war”, s2=“peace”
- war>par>pear>peac>peace
- Operations are: type3, type1, type3,type1
- Edit distance=4
- image.png
Token Sequence：Sequence of (Modified token, Document ID) pairs.
sort：Sort by terms And then docID
Dictionary and Postings：频率和doc list
- image.png

Search

Boolean Model
- Implementation
  - Three basic operations: AND, OR, NOT
  - Construct a binary tree to represent query
  - Process the input lists bottom-up
- 优点：精确查询、实现简单
- 缺点：文档权重一致、用户需要很好的Boolean query
Ranked Retrieval Model
- Jaccard系数：存在Jaccard系数word权重相等的问题
- TF&IDF
  - Term frequency tf：term在document中出现越多，该文档和该词越相关
  - Document frequency df：term在越多document出现，term的信息量越少
  - Inverse document frequency idf： $idf(t)=log_{10}(N/df(t))$ ，N是document的总量
  - Tf-idf: $tf.idf(t, d) = tf(t, d) * idf(t)$
  - Final Rank: $R(Q,d)=\sum_{t\in Q}tf.idf(t,d)$
Vector Space Model向量空间模型VSM
- 把文档看作一系列词(Term)，每一个词(Term)都有一个权重(Term weight)，不同的词(Term)根据自己在文档中的权重来影响文档相关性的打分计算。
- 于是我们把所有此文档中词(term)的权重(term weight) 看作一个向量。
  - Document = {term1, term2, …… ,term N}
  - 计算R(Q,d)：Document Vector = {weight1, weight2, …… ,weight N}
- 同样我们把查询语句看作一个简单的文档，也用向量来表示。
  - Query = {term1, term 2, …… , term N}
  - 计算R(Q,d)：Query Vector = {weight1, weight2, …… , weight N}
- 我们把所有搜索出的文档向量及查询向量放到一个N维空间中，每个词(term)是一维。
  
  Vector space model.jpg
- 两个向量之间的夹角越小，相关性越大。所以我们计算夹角的余弦值作为相关性的打分，夹角越小，余弦值越大，打分越高，相关性越大。
- 正则化并计算cos：
  
  image.png
Probabilistic Model
Evaluation
- Precision：The fraction of retrieved documents that are relevant
- Recall：The fraction of relevant documents that are retrieved
- image.png

homework

//倒排文件索引(Inverted File Index)
import java.io.*;
import java.util.*;

public class Main{
    public static void main(String[] args){
        Main test = new Main();
        test.MyInvertedFileBuilde();
    }
    private void MyInvertedFileBuilde(){
        File dir = new File("documents");
        File[] files = dir.listFiles();
        LinkedHashMap src = new LinkedHashMap<>();
        LinkedHashMap> invertedFile = new LinkedHashMap<>();
        for (int id = 0; id < files.length; id++) {
            File file = files[id];
            String lineTxt = new String();
            if(file.isFile()){
                try {
                    InputStreamReader read = new InputStreamReader(new FileInputStream(file),"GBK");
                    BufferedReader bufferedReader = new BufferedReader(read);
                    String tmp = null;
                    while((tmp = bufferedReader.readLine()) != null){
                        lineTxt = lineTxt + tmp;
                    }
                    src.put(id, lineTxt);//files ID map
                    String[] words = lineTxt.split(" ");
                    for (String word : words) {
                        if(!invertedFile.containsKey(word)){
                            invertedFile.put(word,new LinkedHashSet());
                            invertedFile.get(word).add(id);
                        }else {
                            invertedFile.get(word).add(id); // words inveredFile
                        }
                        
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
        for (String word : invertedFile.keySet()) {
            System.out.println(word+" "+invertedFile.get(word));
        }
    }
}

网络数据挖掘 L1-L3 Indexer&Search

L1 Instruction

L2 Architecture and Spiders

L3 Indexer and Search

Indexer

Search

你可能感兴趣的:(网络数据挖掘 L1-L3 Indexer&Search)