2.3 核心索引过程
2.3.1 FileProp函数
每个 file 都通过 file_properties 函数生成 FileProp 结构,保存了文件的路径、大小、文档类型等。
如果在配置文件中没有初始化文档的类型,则默认为 HTML 类型。我们在配置文件中设置了 IndexContent TXT .txt 类型。
2.3.2 do_index_file 函数
/*********************************************************************** -- Start the real indexing process for a file. -- This routine will be called by the different indexing methods -- (httpd, filesystem, etc.) -- The indexed file may be the -- - real file on filesystem -- - tmpfile or work file (shadow of the real file) -- Checks if file has to be send thru filter (file stream) -- 2000-11-19 rasc ***********************************************************************/ void do_index_file(SWISH * sw, FileProp * fprop) { int (*countwords)(SWISH *sw,FileProp *fprop, FileRec *fi, char *buffer); 注意函数指针的应用; /* -- Read all data, last 1 is flag that we are expecting text only */ rd_buffer = read_stream(sw, fprop, 1); /* just for fun so we can show total bytes shown */ sw->indexlist->total_bytes += fprop->fsize; /* Set which parser to use */ switch (fprop->doctype) { case TXT: strcpy(strType,"TXT"); countwords = countwords_TXT; break; ---------------------------- /* Now bump the file counter */ idx->filenum++; indexf->header.totalfiles++; fi.filenum = idx->filenum; /** PARSE **/ wordcount = countwords(sw, fprop, &fi, rd_buffer);
do_index_file 代码片段( 1 )
do_index_file 代码分析
在配置文件中,我们设定了文档类型为 TXT ,所以函数指针指向 countwords_TXT 函数(在文件txt.c中),利用 indexstring 进行文档的词条解析。
此时为 TXT 类型(最简单的类型), IN_FILE 为 1 ,即:内容只是在文本当中的。
对于 CommonProperties 进行处理 metaID=1; positionMeta=1; /* No metanames in TXT */
return indexstring(sw, buffer, fi->filenum, IN_FILE, 1, &metaID, &positionMeta); 由于是 TXT 格式,对于 metaID 设置为 1 (没有 metaID ) |
count_words_TXT 代码片段
static void addword( char *word, SWISH * sw, int filenum, int structure, int numMetaNames, int *metaID, int *word_position) { int i;
/* Add the word for each nested metaname. */ for (i = 0; i < numMetaNames; i++) (void) addentry(sw, getentry(sw,word), filenum, structure, metaID[i], *word_position);
(*word_position)++; } |
2.3.3 getentry查找词条函数
ENTRY *getentry(SWISH * sw, char *word) { IndexFILE *indexf = sw->indexlist; struct MOD_Index *idx = sw->Index; int hashval; ENTRY *e;
if (!idx->entryArray) { idx->entryArray = (ENTRYARRAY *) emalloc(sizeof(ENTRYARRAY)); idx->entryArray->numWords = 0; idx->entryArray->elist = NULL; } /* Compute hash value of word */ hashval = verybighash(word);
/* Look for the word in the hash array */ for (e = idx->hashentries[hashval]; e; e = e->next) if (strcmp(e->word, word) == 0) break;
/* flag hash entry used this file, so that the locations can be "compressed" in do_index_file */ idx->hashentriesdirty[hashval] = 1;
/* Word found, return it */ if (e) return e;
/* Word not found, so create a new word */ e = (ENTRY *) Mem_ZoneAlloc(idx->entryZone, sizeof(ENTRY) + strlen(word)); strcpy(e->word, word); e->next = idx->hashentries[hashval]; idx->hashentries[hashval] = e;
/* Init values */ e->tfrequency = 0; e->u1.last_filenum = 0; e->currentlocation = NULL; e->currentChunkLocationList = NULL; e->allLocationList = NULL;
idx->entryArray->numWords++; indexf->header.totalwords++; return e; |
getentry 代码片段
getentry 代码分析