lucene索引的增删改(增量索引)

 

 

 

转自:http://tonlo.com/space.php?uid=12933&do=blog&id=8669

 

学lucene也有一段时间了,由于公司环境要求,所以使用的不是java版,而是C#版的,也就是lucene.net。

 

由于是企业级应用,系统上线的时候必须满足增量索引的要求,相信学习lucene的朋友们都知道,lucene自身在增量索引方面实在不敢恭维,因此业界中也出现了一些解决方案,例如Compass方案,就是专门针对了lucene增量索引的问题而开发的,可恨的是,compass是java的框架,并没有相应的.net版,我相信使用.net 版lucene的朋友都或多或少受过这个痛苦,陷于进退两难阶段。

 

因此,针对这种情况,网上有些牛人也想一些只针对索引增删改操作的算法。

 

小弟在网上找解决方案的时候,看到一个csdn的blog(yfw418的专栏),附上链接:

http://blog.csdn.net/yfw418/archive/2007/07/08/1682913.aspx

就有一个简单的算法来实现这个增删改功能,使用的全是lucene自带的函数,当然啦,这个不能跟专门的开源框架相比。

 

因此,小弟不才,写了一下这个算法的实现代码,由于还没上线,所以性能方面的数据还没法提供,但是肯定的是,这个算法的性能是存在问题的,因为每次操作都遍历的所有记录,其算法时间将是O(N*N)。

 

由于项目中是针对数据库中的数据进行索引的,所以添加了个时间字段,是标志该记录的修改或者增加时间的,

还有,针对KEY,使用的是数据库中每条记录的ID标志符,来标志KEY,保证每个索引文档的KEY Field唯一。

 

丑陋的代码如下:

        /// <summary>
        /// 功能描述:用于增量索引
        /// </summary>
        /// <param name="">无参数</param>
        /// <returns>无返回</returns>
        public static void IncreaseIndex()
        {
            System.Console.Out.WriteLine("Increase Index");
            try
            {
                ArrayList deleTerm = new ArrayList();     //存放待删除的Term
                ArrayList addDoc = new ArrayList();       // 存放待添加的文档
                ArrayList DataRow = new ArrayList();     //存放数据库每行记录
                SetSqlReader();
                //Lucene.Net.Index.IndexModifier indexModifier = Lucene.Net.Index.IndexModifier();
                IndexReader indexReader = IndexReader.Open(INDEX_DIR);
                IndexWriter indexWriter = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), false);
                indexWriter.SetUseCompoundFile(false);
                String CurrentTime = System.DateTime.Now.ToString();

                System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
                sw.Start();

                //把数据放到ArrayList中,以便数据定位和反复读取
                while (reader.Read())
                {
                    ArrayList DataField = new ArrayList();    //存放每行记录中的各个字段数据
                    DataField.Add(reader["CName"].ToString());
                    DataField.Add(reader["Age"].ToString());
                    DataField.Add(reader["Date"].ToString());
                    DataField.Add(reader["ID"].ToString());

                    DataRow.Add(DataField);
                }

 

                 //算法需要改进,用顺序遍历会很费时间
                //遍历数据库,判断数据库中记录是否已经全部索引
                int rows = 0;
                while (rows < DataRow.Count)
                {
                    //如果数据库中有记录没有索引,则把该记录添加到待索引队列中;
                    if (!IsInIndex(((ArrayList)DataRow[rows])[3].ToString(), indexReader))
                    {
                        System.Console.Out.WriteLine("增加索引");
                        //把新增的文档添加到待索引队列中
                        Document newDoc = new Document();
                        newDoc.Add(new Field("CName", ((ArrayList)DataRow[rows])[0].ToString(), Field.Store.YES, Field.Index.TOKENIZED));
                        newDoc.Add(new Field("Age", ((ArrayList)DataRow[rows])[1].ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
                        newDoc.Add(new Field("Date", ((ArrayList)DataRow[rows])[2].ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
                        newDoc.Add(new Field("Key", ((ArrayList)DataRow[rows])[3].ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
                        addDoc.Add(newDoc);

                        System.Console.Out.WriteLine("要增加的索引:" + ((ArrayList)DataRow[rows])[0].ToString() + "," + ((ArrayList)DataRow[rows])[1].ToString()
                                                           + "," + ((ArrayList)DataRow[rows])[2].ToString() + "," + ((ArrayList)DataRow[rows])[3].ToString() );
                    }

                    rows++;
                }

                //算法需要改进,用顺序遍历会很费时间
                //判断数据库中的记录是否修改,删除
                for (int i = 0; i < indexReader.NumDocs(); i++)
                {
                    Document doc = indexReader.Document(i);
                    String docKey = doc.Get("Key").ToString();
                    int position = -1;                          //标志数据删除或者修改的位置
                    //如果某个已经索引的文档,没在数据库中找到相应的Key,则删除
                    if (-1 >= (position = IsInDatabase(docKey, DataRow)))
                    {
                        //把原索引添加到待删除队列中
                        deleTerm.Add(new Term("Key", docKey));
                        System.Console.Out.WriteLine("数据库已经删除的数据,同时删除它的索引Key是:" + docKey);
                    }
                    else if (0 <= position)//修改
                    {
                        String docDate = doc.Get("Date").ToString();

                        //如果索引中文档的修改时间和数据库不一致,则修改索引
                        if (!docDate.Equals(((ArrayList)DataRow[position])[2]))
                        {
                            System.Console.Out.WriteLine("修改索引");
                            //把原索引添加到待删除队列中
                            deleTerm.Add(new Term("Key", docKey));

                            //把修改的文档添加到待索引队列
                            Document newDoc = new Document();
                            newDoc.Add(new Field("CName", ((ArrayList)DataRow[position])[0].ToString(), Field.Store.YES, Field.Index.TOKENIZED));
                            newDoc.Add(new Field("Age", ((ArrayList)DataRow[position])[1].ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
                            newDoc.Add(new Field("Date", ((ArrayList)DataRow[position])[2].ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
                            newDoc.Add(new Field("Key", ((ArrayList)DataRow[position])[3].ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
                            addDoc.Add(newDoc);

                            System.Console.Out.WriteLine("要修改的索引:" + ((ArrayList)DataRow[position])[0].ToString() + "," + ((ArrayList)DataRow[position])[1].ToString()
                                                           + "," + ((ArrayList)DataRow[position])[2].ToString() + "," + ((ArrayList)DataRow[position])[3].ToString() );
                        }
                        else
                        {
                            //什么都不做
                        }
                    }
                    else
                    {
                        //什么都不做
                    }
                }

                //进行删除操作
                if (deleTerm != null)
                {
                    for (int i = 0; i < deleTerm.Count; i++)
                    {
                        Lucene.Net.Index.Term term = (Term)deleTerm[i];
                        indexReader.DeleteDocuments(term);
                    }
                }
                indexReader.Close();

               
                //进行添加操作
                if (addDoc != null)
                {
                    for (int i = 0; i < addDoc.Count; i++)
                    {
                        Document doc = (Document)addDoc[i];
                        indexWriter.AddDocument(doc);
                    }
                }
                indexWriter.Optimize();
                indexWriter.Close();
                sw.Stop();
                System.Console.Out.WriteLine(" 增量索引时间是:" + sw.ElapsedMilliseconds.ToString() + "毫秒");
            }

            catch (Exception exp)
            {
                System.Console.WriteLine("增量索引错误" + exp.Message);
                System.Console.ReadLine();
            }
            finally
            {
                reader.Close();
                connection.Close();
            }
        }

       

        /// <summary>
        /// 功能描述:判断目前某个索引文档在数据库中是否存在
        /// </summary>
        /// <param name="p_strdocKey">要比较文档的Key值</param>
        /// <param name="p_DataRow">ArrayList对象,存储了SqlDataReader读出来的数据</param>
        /// <returns>返回该索引在数据库中的记录位置,负数表示不存在</returns>
        public static int IsInDatabase(String p_strdocKey, ArrayList p_DataRow)
        {
            int position = 0;
            while (position < p_DataRow.Count)
            {
                if (p_strdocKey.Equals(((ArrayList)p_DataRow[position])[5].ToString()))
                {
                    return position;
                }
                position++;
            }
            return -1;
        }

        /// <summary>
        /// 功能描述:判断数据库中的某条数据是否已经索引了
        /// </summary>
        /// <param name="p_strDataKey">数据库记录的Key字段值</param>
        /// <param name="p_indexReader">IndexReader的对象</param>
        /// <returns>true表示存在, false表示不存在</returns>
        public static bool IsInIndex(String p_strDataKey, IndexReader p_indexReader)
        {
            for (int i = 0; i < p_indexReader.NumDocs(); i++)
            {
                Document doc = p_indexReader.Document(i);
                if (p_strDataKey.Equals(doc.Get("Key")))
                {
                    return true;
                }
            }
            return false;
        }

你可能感兴趣的:(lucene索引的增删改(增量索引))