Comparing Document Classification Functions of Lucene and Mahout

Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results.

Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.

Meanwhile, let’s use Mahout to do document classification with Naive Bayes and Random Forest as well.

Overview of Lucene Document Classification

Lucene’s classifier for document classification is defined as the Classifier interface.

 
     ? 
    
          public 
          interface 
          Classifier<T> { 
         
          /** 
         
          * Assign a class (with score) to the given text String 
         
          * @param text a String containing text to be classified 
         
          * @return a {@link ClassificationResult} holding assigned class of type <code>T</code> and score 
         
          * @throws IOException If there is a low-level I/O error. 
         
          */ 
         
          public 
          ClassificationResult<T> assignClass(String text)  
          throws 
          IOException; 
         
          /** 
         
          * Train the classifier using the underlying Lucene index 
         
          * @param atomicReader the reader to use to access the Lucene index 
         
          * @param textFieldName the name of the field used to compare documents 
         
          * @param classFieldName the name of the field containing the class assigned to documents 
         
          * @param analyzer the analyzer used to tokenize / filter the unseen text 
         
          * @param query the query to filter which documents use for training 
         
          * @throws IOException If there is a low-level I/O error. 
         
          */ 
         
          public 
          void 
          train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer, Query query) 
         
          throws 
          IOException; 
         
          }

You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.

Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().

 
     ? 
    
          public 
          class 
          ClassificationResult<T> { 
         
          private 
          final 
          T assignedClass; 
         
          private 
          final 
          double 
          score; 
         
          /** 
         
          * Constructor 
         
          * @param assignedClass the class <code>T</code> assigned by a {@link Classifier} 
         
          * @param score the score for the assignedClass as a <code>double</code> 
         
          */ 
         
          public 
          ClassificationResult(T assignedClass,  
          double 
          score) { 
         
          this 
          .assignedClass = assignedClass; 
         
          this 
          .score = score; 
         
          } 
         
          /** 
         
          * retrieve the result class 
         
          * @return a <code>T</code> representing an assigned class 
         
          */ 
         
          public 
          T getAssignedClass() { 
         
          return 
          assignedClass; 
         
          } 
         
          /** 
         
          * retrieve the result score 
         
          * @return a <code>double</code> representing a result score 
         
          */ 
         
          public 
          double 
          getScore() { 
         
          return 
          score; 
         
          } 
         
          }

Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.

Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.

As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.

Now, let’s quickly go through how the 2 implement classes of Classifier interface do document classification and actually call them from a program.

Using Lucene SimpleNaiveBayesClassifier

SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).

We now use SimpleNaiveBayesClassifier, but before that, we need to prepare learning data in an index. Here we use livedoor news corpusas our corpus. Let’s add livedoor news corpus to the index using schema definition Solr as follows.

 
     ? 
    
 
      
        
        
          <? 
          xml 
          version 
          = 
          "1.0" 
          encoding 
          = 
          "UTF-8" 
          ?> 
         
 
          < 
          schema 
          name 
          = 
          "example" 
          version 
          = 
          "1.5" 
          > 
         
 
             
          < 
          fields 
          > 
         
 
               
          < 
          field 
          name 
          = 
          "url" 
          type 
          = 
          "string" 
          indexed 
          = 
          "true" 
          stored 
          = 
          "true" 
          required 
          = 
          "true" 
          multiValued 
          = 
          "false" 
          /> 
         
 
               
          < 
          field 
          name 
          = 
          "cat" 
          type 
          = 
          "string" 
          indexed 
          = 
          "true" 
          stored 
          = 
          "true" 
          required 
          = 
          "true" 
          multiValued 
          = 
          "false" 
          /> 
         
 
               
          < 
          field 
          name 
          = 
          "title" 
          type 
          = 
          "text_ja" 
          indexed 
          = 
          "true" 
          stored 
          = 
          "true" 
          multiValued 
          = 
          "false" 
          /> 
         
 
               
          < 
          field 
          name 
          = 
          "body" 
          type 
          = 
          "text_ja" 
          indexed 
          = 
          "true" 
          stored 
          = 
          "true" 
          multiValued 
          = 
          "true" 
          /> 
         
 
               
          < 
          field 
          name 
          = 
          "date" 
          type 
          = 
          "date" 
          indexed 
          = 
          "true" 
          stored 
          = 
          "true" 
          /> 
         
 
             
          </ 
          fields 
          > 
         
 
             
          < 
          uniqueKey 
          >url</ 
          uniqueKey 
          > 
         
 
             
          < 
          types 
          > 
         
 
               
          < 
          fieldType 
          name 
          = 
          "string" 
          class 
          = 
          "solr.StrField" 
          sortMissingLast 
          = 
          "true" 
          /> 
         
 
               
          < 
          fieldType 
          name 
          = 
          "boolean" 
          class 
          = 
          "solr.BoolField" 
          sortMissingLast 
          = 
          "true" 
          /> 
         
 
               
          < 
          fieldType 
          name 
          = 
          "int" 
          class 
          = 
          "solr.TrieIntField" 
          precisionStep 
          = 
          "0" 
          positionIncrementGap 
          = 
          "0" 
          /> 
         
 
               
          < 
          fieldType 
          name 
          = 
          "float" 
          class 
          = 
          "solr.TrieFloatField" 
          precisionStep 
          = 
          "0" 
          positionIncrementGap 
          = 
          "0" 
          /> 
         
 
               
          < 
          fieldType 
          name 
          = 
          "long" 
          class 
          = 
          "solr.TrieLongField" 
          precisionStep 
          = 
          "0" 
          positionIncrementGap 
          = 
          "0" 
          /> 
         
 
               
          < 
          fieldType 
          name 
          = 
          "double" 
          class 
          = 
          "solr.TrieDoubleField" 
          precisionStep 
          = 
          "0" 
          positionIncrementGap 
          = 
          "0" 
          /> 
         
 
               
          < 
          fieldType 
          name 
          = 
          "date" 
          class 
          = 
          "solr.TrieDateField" 
          precisionStep 
          = 
          "0" 
          positionIncrementGap 
          = 
          "0" 
          /> 
         
 
               
          < 
          fieldType 
          name 
          = 
          "text_ja" 
          class 
          = 
          "solr.TextField" 
          positionIncrementGap 
          = 
          "100" 
          autoGeneratePhraseQueries 
          = 
          "false" 
          > 
         
 
                 
          < 
          analyzer 
          > 
         
 
                   
          < 
          tokenizer 
          class 
          = 
          "solr.JapaneseTokenizerFactory" 
          mode 
          = 
          "search" 
          /> 
         
 
                   
          < 
          filter 
          class 
          = 
          "solr.JapaneseBaseFormFilterFactory" 
          /> 
         
 
                   
          < 
          filter 
          class 
          = 
          "solr.JapanesePartOfSpeechStopFilterFactory" 
          tags 
          = 
          "lang/stoptags_ja.txt" 
          /> 
         
 
                   
          < 
          filter 
          class 
          = 
          "solr.CJKWidthFilterFactory" 
          /> 
         
 
                   
          < 
          filter 
          class 
          = 
          "solr.StopFilterFactory" 
          ignoreCase 
          = 
          "true" 
          words 
          = 
          "lang/stopwords_ja.txt" 
          /> 
         
 
                   
          < 
          filter 
          class 
          = 
          "solr.JapaneseKatakanaStemFilterFactory" 
          minimumLength 
          = 
          "4" 
          /> 
         
 
                   
          < 
          filter 
          class 
          = 
          "solr.LowerCaseFilterFactory" 
          /> 
         
 
                 
          </ 
          analyzer 
          > 
         
 
               
          </ 
          fieldType 
          > 
         
 
             
          </ 
          types 
          > 
         
 
          </ 
          schema 
          > 
         
 
      
 
    

Note that the cat field is a classification class while body field is the target learning field. First, start Solr with the above schema.xml and add livedoor news corpus. You can stop Solr as soon as you finish adding the corpus.

Next, we need a Java program that uses SimpleNaiveBayesClassifier. To make things easier, we will use the same document we used for learning for classification test as is. The program looks like as follows.

 
     ? 
    
          public 
          final 
          class 
          TestLuceneIndexClassifier { 
         
          public 
          static 
          final 
          String INDEX =  
          "solr2/collection1/data/index" 
          ; 
         
          public 
          static 
          final 
          String[] CATEGORIES = { 
         
          "dokujo-tsushin" 
          , 
         
          "it-life-hack" 
          , 
         
          "kaden-channel" 
          , 
         
          "livedoor-homme" 
          , 
         
          "movie-enter" 
          , 
         
          "peachy" 
          , 
         
          "smax" 
          , 
         
          "sports-watch" 
          , 
         
          "topic-news" 
         
          }; 
         
          private 
          static 
          int 
          [][] counts; 
         
          private 
          static 
          Map<String, Integer> catindex; 
         
          public 
          static 
          void 
          main(String[] args)  
          throws 
          Exception { 
         
          init(); 
         
          final 
          long 
          startTime = System.currentTimeMillis(); 
         
          SimpleNaiveBayesClassifier classifier =  
          new 
          SimpleNaiveBayesClassifier(); 
         
          IndexReader reader = DirectoryReader.open(dir()); 
         
          AtomicReader ar = SlowCompositeReaderWrapper.wrap(reader); 
         
          classifier.train(ar,  
          "body" 
          ,  
          "cat" 
          ,  
          new 
          JapaneseAnalyzer(Version.LUCENE_46)); 
         
          final 
          int 
          maxdoc = reader.maxDoc(); 
         
          for 
          ( 
          int 
          i =  
          0 
          ; i < maxdoc; i++){ 
         
          Document doc = ar.document(i); 
         
          String correctAnswer = doc.get( 
          "cat" 
          ); 
         
          final 
          int 
          cai = idx(correctAnswer); 
         
          ClassificationResult<BytesRef> result = classifier.assignClass(doc.get( 
          "body" 
          )); 
         
          String classified = result.getAssignedClass().utf8ToString(); 
         
          final 
          int 
          cli = idx(classified); 
         
          counts[cai][cli]++; 
         
          } 
         
          final 
          long 
          endTime = System.currentTimeMillis(); 
         
          final 
          int 
          elapse = ( 
          int 
          )(endTime - startTime) /  
          1000 
          ; 
         
          // print results 
         
          int 
          fc =  
          0 
          , tc =  
          0 
          ; 
         
          for 
          ( 
          int 
          i =  
          0 
          ; i < CATEGORIES.length; i++){ 
         
          for 
          ( 
          int 
          j =  
          0 
          ; j < CATEGORIES.length; j++){ 
         
          System.out.printf( 
          " %3d " 
          , counts[i][j]); 
         
          if 
          (i == j){ 
         
          tc += counts[i][j]; 
         
          } 
         
          else 
          { 
         
          fc += counts[i][j]; 
         
          } 
         
          } 
         
          System.out.println(); 
         
          } 
         
          float 
          accrate = ( 
          float 
          )tc / ( 
          float 
          )(tc + fc); 
         
          float 
          errrate = ( 
          float 
          )fc / ( 
          float 
          )(tc + fc); 
         
          System.out.printf( 
          "\n\n*** accuracy rate = %f, error rate = %f; time = %d (sec); %d docs\n" 
          , accrate, errrate, elapse, maxdoc); 
         
          reader.close(); 
         
          } 
         
          static 
          Directory dir()  
          throws 
          IOException { 
         
          return 
          FSDirectory.open( 
          new 
          File(INDEX)); 
         
          } 
         
          static 
          void 
          init(){ 
         
          counts =  
          new 
          int 
          [CATEGORIES.length][CATEGORIES.length]; 
         
          catindex =  
          new 
          HashMap<String, Integer>(); 
         
          for 
          ( 
          int 
          i =  
          0 
          ; i < CATEGORIES.length; i++){ 
         
          catindex.put(CATEGORIES[i], i); 
         
          } 
         
          } 
         
          static 
          int 
          idx(String cat){ 
         
          return 
          catindex.get(cat); 
         
          } 
         
          }

Here we specified JapaneseAnalyzer as Analyzer (On the other hand, there is a slight difference when we create index because we use JapaneseTokenizer and relevant TokenFilter with a Solr function). A character string array CATEGORIES has document category hard-coded. Executing this program displays a confusion matrix like Mahout but the elements in the matrix are in the same order as array elements of document category that are hard-coded.

Executing this program displays the followings.

 
     ? 
    
          760    0    4   23   37   37    2    2    5 
         
          40  656    7   44   25    4   90    1    3 
         
          87   57  392  102   68   24  113    5   16 
         
          40   15    6  391   33    8   16    2    0 
         
          14    2    0    5  845    2    0    1    1 
         
          134    2    2   26  107  549   19    3    0 
         
          43   36   13   17   26   36  693    5    1 
         
          6    0    0   23   35    0    1  829    6 
         
          10    9    9   25   66    6    5   45  595  
         
          *** accuracy rate = 0.775078, error rate = 0.224922; time = 67 (sec); 7367 docs

The classification accuracy rate went up to 77%.

Using Lucene KNearestNeighborClassifier

Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.

The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.

Executing the same program as KNearestNeighborClassifier will display the following when k=1.

 
     ? 
    
          724   14   28   22    6   30    8   18   20 
         
          121  630   41   13    2    9   35    6   13 
         
          165   28  582   10    5   16   26    7   25 
         
          229   15   15  213    6   14    6    2   11 
         
          134   37   15    8  603   12   19    7   35 
         
          266   38   39   24   14  412   22    9   18 
         
          810   16    1    3    2    3   32    1    2 
         
          316   18   14   12    5    7    8  439   81 
         
          362   17   29   10    1    7    7   16  321  
         
          *** accuracy rate = 0.536989, error rate = 0.463011; time = 13 (sec); 7367 docs

Now the accuracy rate is 53%. In addition, if you take k=3, accuracy rate goes down to 48%.

 
     ? 
    
          652    5   78    3    7   40   13   38   34 
         
          127  540   82   15    1   10   58   23   14 
         
          169   34  553    3    7   16   38   15   29 
         
          242   10   32  156   12   13   15   10   21 
         
          136   30   21    9  592   11   19   15   37 
         
          309   34   58    5   23  318   40   28   27 
         
          810    8    3    1    0   10   37    1    0 
         
          312    8   44    7    5    2   13  442   67 
         
          362   11   45    5    6   10   16   34  281  
         
          *** accuracy rate = 0.484729, error rate = 0.515271; time = 9 (sec); 7367 docs

Document Classification by NLP4L and Mahout

If you want to use Lucene’s index as an input data in Mahout, there’s a handy command available. However, the purpose is to do document classification for a class with an instructor, you need to output field information, which specifies a class, in addition to document vector.

The tools that can easily do this are NLP4L MSDDumper and TermsDumper that we developed. NLP4L stands for Natural Language Processing for Lucene and is a natural language processing tool set that sees Lucene’s index as corpus.

Depending on the setting, MSDDumper and TermsDumper select and extract important words from Lucene’s field according to keys like tf*idf and outputs them in a format that is easy for Mahout command to read. Let’s use this function to select 2,000 important words from the body field of index and do the Mahout classification.

Looking only at the result, Mahout Naive Bayes shows accuracy rate of 96%.

 
     ? 
    
          ======================================================= 
         
          Summary 
         
          ------------------------------------------------------- 
         
          Correctly Classified Instances          :       7128       96.7689% 
         
          Incorrectly Classified Instances        :        238        3.2311% 
         
          Total Classified Instances              :       7366 
         
          ======================================================= 
         
          Confusion Matrix 
         
          ------------------------------------------------------- 
         
          a       b       c       d       e       f       g       h       i       <--Classified as 
         
          823     1       1       6       12      19      2       4       2        |  870     a     = dokujo-tsushin 
         
          1       848     2       1       0       1       11      4       2        |  870     b     = it-life-hack 
         
          5       6       830     1       1       0       3       1       17       |  864     c     = kaden-channel 
         
          2       6       6       486     3       1       6       0       0        |  510     d     = livedoor-homme 
         
          0       0       1       1       865     1       0       1       1        |  870     e     = movie-enter 
         
          31      3       6       12      14      762     6       4       4        |  842     f     = peachy 
         
          0       0       2       0       0       1       867     0       0        |  870     g     = smax 
         
          0       0       0       1       0       0       0       897     2        |  900     h     = sports-watch 
         
          2       4       1       1       0       0       0       12      750      |  770     i     = topic-news 
         
          ======================================================= 
         
          Statistics 
         
          ------------------------------------------------------- 
         
          Kappa                                        0.955 
         
          Accuracy                                   96.7689% 
         
          Reliability                                87.0076% 
         
          Reliability (standard deviation)             0.307

Also, Mahout Random Forest shows accuracy rate of 97%.

?

你可能感兴趣的:(机器学习)

清华和哈工大把大模型量化做到了1比特，把世界顶尖多模态大模型开源大模型量化个人电脑运行！机器人领域首个开源视觉-语言操作大模型，激发开源VLMs更大潜能，视 Mamba速度提升2.8倍，内存能省87% 代码讲故事机器人智慧之心 Mamba 机器人量化大模型开源视觉 VLMs
清华和哈工大把大模型量化做到了1比特，把世界顶尖多模态大模型开源大模型量化个人电脑运行！机器人领域首个开源视觉-语言操作大模型，激发开源VLMs更大潜能，视Mamba速度提升2.8倍，内存能省87%。清华和哈工大把大模型量化做到了1比特。在追求更高效的机器学习模型部署时，模型量化技术应运而生，它通过降低权重矩阵的位宽来显著减少大型语言模型的存储和计算需求。我们一般的双精度浮点型double是64位
【机器学习】多模态AI——融合多种数据源的智能系统 2的n次方_ 人工智能
随着人工智能的快速发展，单一模态（如文本、图像或语音）已经不能满足复杂任务的需求。多模态AI（MultimodalAI）通过结合多种数据源（如文本、图像、音频等）来提升模型的智能和表现，适用于多样化的应用场景，如自动驾驶、医疗诊断、跨语言翻译等。一、多模态AI简介多模态AI是一种将不同形式的数据（如文本、图像、音频等）融合在一起的技术，旨在让模型从多个维度感知和理解信息。这种融合使得AI系统能够从
Python3.13来了！编程爱好者必看 Python之栈人工智能 python 开发语言
Python3.13于近期发布，其中包含大量重要更新。Python作为机器学习、数据科学和人工智能领域使用最广泛的编程语言，一直在不断发展，以满足这些领域日益增长的需求。最新发布的Python3.13提供了多项具有影响力的改进，旨在提高性能和生产力，对于从事ML和AI项目的开发人员来说是一个重要的里程碑。Python在ML和AI领域的主导地位主要归功于它的简单性、广泛的库支持和庞大的社区。然而，随
Transformer入门（1）transformer及其编码器-解码器通信仿真实验室 Google BERT 构建和训练NLP模型 bert transformer 人工智能 NLP 自然语言处理
文章目录1.Transformer简介2.Transformer的编码器-解码器架构3.transformer的编码器1.Transformer简介Transformer模型是一种用于自然语言处理的机器学习模型，它在2017年由Google的研究者提出，并在论文《AttentionisAllYouNeed》中详细描述。Transformer模型的核心创新在于其采用了自注意力（self-attent
【人工智能 | 大数据】基于人工智能的大数据分析方法用心去追梦人工智能大数据数据分析
基于人工智能（AI）的大数据分析方法是指利用机器学习、深度学习和其他AI技术来分析和处理大规模数据集。这些方法能够自动识别模式、提取有用信息，并做出预测或决策，从而帮助企业和组织更好地理解市场趋势、客户行为以及其他关键因素。以下是几种主要的基于AI的大数据分析方法：机器学习模型：通过训练算法让计算机从历史数据中学习并做出预测或分类。常见的机器学习技术包括监督学习（如回归分析、支持向量机）、非监督学
基于MATLAB机器学习、深度学习实践技术应用梦想的初衷~ 机器学习人工智能 matlab 机器学习深度学习
近年来，MATLAB在机器学习和深度学习领域的发展取得了显著成就。其强大的计算能力和灵活的编程环境使其成为科研人员和工程师的首选工具。在无人驾驶汽车、医学影像智能诊疗、ImageNet竞赛等热门领域，MATLAB提供了丰富的算法库和工具箱，极大地推动了人工智能技术的应用和创新。原文链接https://mp.weixin.qq.com/s?__biz=Mzg2NDYxNjMyNA==&mid=224
降维算法：主成分分析一个人在码代码的章鱼数学建模机器学习概率论
主成分分析一种常用的数据分析技术，主要用于数据降维，在众多领域如统计学、机器学习、信号处理等都有广泛应用。主成分分析是一种通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量（即主成分）的方法。这些主成分按照方差从大到小排列，方差越大，包含的原始数据信息越多。通常会选取前几个方差较大的主成分，以达到在尽量保留原始数据信息的前提下降低数据维度的目的。它通过将多个指标转换为少数几个主成分,
Python从0到100（八十三）：神经网络-使用残差网络RESNET识别手写数字是Dream呀 python 神经网络网络
前言：零基础学Python：Python从0到100最新最全教程。想做这件事情很久了，这次我更新了自己所写过的所有博客，汇集成了Python从0到100，共一百节课，帮助大家一个月时间里从零基础到学习Python基础语法、Python爬虫、Web开发、计算机视觉、机器学习、神经网络以及人工智能相关知识，成为学习学习和学业的先行者！欢迎大家订阅专栏：零基础学Python：Python从0到100最新
【人工智能】Python实战：构建高效的多任务学习模型蒙娜丽宁 Python杂谈 AI 人工智能 python 学习
《PythonOpenCV从菜鸟到高手》带你进入图像处理与计算机视觉的大门！解锁Python编程的无限可能：《奇妙的Python》带你漫游代码世界多任务学习（Multi-taskLearning,MTL）作为机器学习领域中的一种重要方法，通过在单一模型中同时学习多个相关任务，不仅能够提高模型的泛化能力，还能有效利用任务间的共享信息。本文深入探讨了多任务学习的基本概念、优势及其在实际应用中的重要性。
基于 Python 的机器学习模型部署到 Flask Web 应用：从训练到部署的完整指南 m0_74825223 python 机器学习 flask
目录引言技术栈步骤一：数据预处理步骤二：训练机器学习模型步骤三：创建FlaskWeb应用步骤四：测试Web应用步骤五：模型的保存与加载保存模型加载模型并在Flask中使用步骤六：Web应用的安全性考量示例：简单的输入验证示例：自定义错误处理示例：使用Flask-JWT-Extended进行认证结论参考资料引言在当今数据驱动的时代，机器学习模型已经广泛应用于各行各业，从金融、医疗到教育等领域。然而，
机器学习：scikit-learn 和 Jupyter Notebook（推荐初学者使用google colab） wyc9999ww 机器学习 scikit-learn jupyter 人工智能 python
对于初学者来说，scikit-learn是一个理想的机器学习入门工具。不仅提供了丰富的算法和功能，还通过一致的API设计，确保能够快速上手并进行各种机器学习任务。通过使用scikit-learn，可以专注于理解和实践机器学习的核心概念，而不必过多担心底层实现细节。所以scikit-learn能轻松实现从数据预处理到模型训练和评估的完整流程。此外在推荐一个适合初学者的深度学习平台工具googleco
有趣的python代码实例_Python之路：200个Python有趣的小例子一网打尽 weixin_39845406 有趣的python代码实例
概述博主最近在学习python，看完了一整套学习视频，然后呃呃呃，还是用不太流畅。碰巧在全球最大的同性交友论坛GayHub(呸！是开源代码托管平台Github)上面发现了一个项目，该项目列举了200多个Python小例子，Python基础、Python坑点、Python字符串和正则、Python绘图、Python日期和文件、Web开发、数据科学、机器学习、深度学习、TensorFlow、Pytor
机器学习数学基础-定积分应用-经济问题华东算法王（原聪明的小孩子小孩哥解析宋浩微积分算法
定积分在经济学中的应用广泛，特别是用来解决与累积量、平均值、总收入、成本、利润等相关的问题。以下是定积分在经济学中的几个常见应用场景：1.总收入和总成本的计算在经济学中，定积分常用于计算总收入、总成本等累积量。如果给定价格函数和需求函数或供应函数，定积分可以帮助我们计算从某一数量到另一数量之间的总收入或总成本。总收入：假设某商品的价格随数量的变化而变化，价格函数为(p(x))，其中(x)表示销售的
迁移学习与RBF神经网络 fanxbl957 人工智能理论与实践迁移学习神经网络人工智能
迁移学习与RBF神经网络一、引言在机器学习和深度学习领域，迁移学习和神经网络都是备受关注的重要技术。迁移学习旨在将从一个或多个源任务中学习到的知识应用到目标任务中，以加快目标任务的学习过程，提高学习效果，尤其在数据稀缺或训练资源有限的情况下展现出显著优势。而RBF（径向基函数）神经网络作为一种经典的神经网络结构，以其独特的函数逼近能力和良好的局部逼近特性，在众多领域取得了出色的性能表现。将迁移学习
用大数据“喂养”出来的AI模型ChatGPT 爆火是大数据、大算力、强算法的支撑，中国缺乏的什么？ Ai17316391579 深度学习服务器人工智能
先来了解一下ChatGPT的基本情况ChatGPT本质属于生成式人工智能，属于无监督或半监督的机器学习。与之相关的还有Discriminativemodeling区分式模型，区分式模型大多属于监督式学习。生成性人工智能目前有两种主要的框架：GAN（GenerativeAdversarialNetwork）和GPT（GenerativePre-trainedTransformer）。GAN目前广泛应
AIGC视频生成国产之光：ByteDance的PixelDance模型好评笔记 AIGC-视频补档 AIGC 计算机视觉人工智能深度学习机器学习论文阅读面试
大家好，这里是好评笔记，公主号：Goodnote，专栏文章私信限时Free。本文详细介绍ByteDance的视频生成模型PixelDance，论文于2023年11月发布，模型上线于2024年9月，同时期上线的模型还有Seaweed（论文未发布）。优质专栏回顾：机器学习笔记深度学习笔记多模态论文笔记AIGC—图像文章目录论文摘要引言输入训练和推理时的数据处理总结相关工作视频生成长视频生成方法模型架构
Python气象数据分析：风速预报订正、台风预报数据智能订正、机器学习预测风电场的风功率、浅水模型、预测ENSO等小艳加油大气科学 python 人工智能气象机器学习
目录专题一Python和科学计算基础专题二机器学习和深度学习基础理论和实操专题三气象领域中的机器学习应用实例专题四气象领域中的深度学习应用实例更多应用Python是功能强大、免费、开源，实现面向对象的编程语言，在数据处理、科学计算、数学建模、数据挖掘和数据可视化方面具备优异的性能，这些优势使得Python在气象、海洋、地理、气候、水文和生态等地学领域的科研和工程项目中得到广泛应用。可以预见未来Py
YOLOv8/YOLOv11使用web界面推理自己的模型，Gradio框架快速搭建挂科边缘 YOLOv8改进 YOLO 前端计算机视觉目标检测人工智能 python
前言Gradio是一个开源Python库，用于快速构建和共享机器学习模型的Web界面。开发者可以通过简单的Python代码将机器学习模型封装成交互式应用，无需复杂的设置即可在浏览器中使用自己训练好模型。接下来教你使用Gradio框架构建一个简单Web界面推理YOLOv8/YOLOv11模型。话不多说上检测结果：一、YOLOv8/YOLOv11源码下载YOLOv8源码下载：官网打不开的话，从我的网盘
深度学习笔记——模型部署好评笔记深度学习笔记深度学习笔记人工智能 transformer 模型部署大模型部署大模型
大家好，这里是好评笔记，公主号：Goodnote，专栏文章私信限时Free。本文简要概括模型部署的知识点，包括步骤和部署方式。文章目录模型部署模型部署的关键步骤常见的模型部署方式优势与挑战总结边缘端部署方案总结历史文章机器学习深度学习模型部署模型部署是指将训练好的机器学习或深度学习模型集成到生产环境中，使其能够在实际应用中处理实时数据和提供预测服务。模型部署的流程涉及模型的封装、部署环境的选择、部
探索泰坦尼克号生存分类数据集：机器学习与数据分析的完美起点岑童嵘
探索泰坦尼克号生存分类数据集：机器学习与数据分析的完美起点【下载地址】泰坦尼克号生存分类数据集本仓库提供了一个经典的机器学习数据集——泰坦尼克号生存分类数据集。该数据集包含两个CSV文件：训练集和测试集。数据集主要用于训练和评估机器学习模型，以预测泰坦尼克号乘客的生存情况项目地址:https://gitcode.com/open-source-toolkit/35561项目介绍泰坦尼克号生存分类数
基于Python机器学习、深度学习技术提升气象、海洋、水文领域实践应用 KY_chenzhao python 机器学习深度学习气象
1.背景与目标ENSO（ElNiño-SouthernOscillation）是全球气候系统中最显著的年际变率现象之一，对全球气候、农业、渔业等有着深远的影响。准确预测ENSO事件的发生和发展对于减灾防灾具有重要意义。近年来，深度学习技术在气象领域得到了广泛应用，其中长短期记忆网络（LSTM）因其在处理时间序列数据方面的优势，被广泛用于ENSO预测。2.数据准备数据来源包括NOAA（美国国家海洋和
R语言的软件工程 BinaryBardC 包罗万象 golang 开发语言后端
R语言的软件工程1.引言随着数据科学的快速发展，R语言作为一种统计计算和图形绘制的编程语言，其在数据分析、可视化以及机器学习等领域的应用日益广泛。尽管R语言在数据处理上有其独特的优势，但要将其运用于大型项目和商业应用中，就需要遵循软件工程的原则。本篇文章将探讨R语言在软件工程中的应用，主要涵盖软件开发生命周期、代码规范、版本控制、测试和文档等方面。2.软件开发生命周期软件开发生命周期（SDLC）是
Python中的Pipeline快速教学、 Coding Is Fun python 开发语言
在Python中，Pipeline通常指的是机器学习工作流中的流水线，尤其是在使用scikit-learn库时。Pipeline允许你将多个数据处理步骤和模型训练步骤串联起来，形成一个有序的工作流程。这不仅使代码更简洁，还能确保在训练和预测时一致的数据处理。以下是一个快速教学，帮助你掌握Python中Pipeline的核心概念和使用方法。目录安装和导入必要的库Pipeline的基本概念创建一个简单
大模型介绍詹姆斯爱研究Java spring
大模型（LargeModel）指的是拥有庞大参数量的机器学习模型。由于具有更多的参数，大模型能够更好地拟合复杂的数据和模式，从而提供更准确的预测和更好的性能。大模型的参数量通常远远超过常规模型，可以达到数百万甚至数十亿个参数。这些参数通常通过深度神经网络（DeepNeuralNetwork）来表示，包括多个隐藏层和大量的神经元。大模型的训练需要大量的计算资源和数据。通常，它们需要在多个GPU或TP
Python从0到100（七十三）：Python OpenCV-OpenCV实现手势虚拟拖拽是Dream呀 python opencv 开发语言
前言：零基础学Python：Python从0到100最新最全教程。想做这件事情很久了，这次我更新了自己所写过的所有博客，汇集成了Python从0到100，共一百节课，帮助大家一个月时间里从零基础到学习Python基础语法、Python爬虫、Web开发、计算机视觉、机器学习、神经网络以及人工智能相关知识，成为学习学习和学业的先行者！欢迎大家订阅专栏：零基础学Python：Python从0到100最新
K-means聚类：解锁数据隐藏结构的钥匙陈辰学长 kmeans 聚类机器学习
K-means聚类：解锁数据隐藏结构的钥匙在机器学习的广阔领域中，无监督学习以其独特的魅力吸引了众多研究者和实践者。其中，K-means聚类作为一种经典且实用的无监督学习算法，以其简单高效的特点，广泛应用于市场细分、图像分割和基因聚类等领域。本文将深入探讨K-means聚类的工作原理、应用实例及其在这些领域中的具体应用，旨在揭示其如何智能划分数据，解锁隐藏结构，为相关领域提供精准导航。一、K-me
与机器学习的邂逅--自适应神经网络结构的深度解析想成为高手499 机器学习与人工智能机器学习神经网络人工智能
引言随着人工智能的发展，神经网络已成为许多应用领域的重要工具。自适应神经网络（AdaptiveNeuralNetworks，ANN）因其出色的学习能力和灵活性，逐渐成为研究的热点。本文将详细探讨自适应神经网络的基本概念、工作原理、关键技术、C++实现示例及其应用案例，最后展望未来的发展趋势。自适应神经网络的基本概念什么是自适应神经网络？自适应神经网络是一种能够根据输入数据的变化和环境的动态特性自动
PostgreSQL - pgvector 插件构建向量数据库并进行相似度查询花千树-010 RAG 数据库 postgresql AI编程
在现代的机器学习和人工智能应用中，向量相似度检索是一个非常重要的技术，尤其是在文本、图像或其他类型的嵌入向量的操作中。本文将介绍如何在PostgreSQL中安装pgvector插件，用于存储和检索向量数据，并展示如何通过Python脚本向数据库插入向量并执行相似度查询。一、安装PostgreSQL并配置pgvector插件1.安装PostgreSQL首先，确保你已经安装了PostgreSQL。可以
未来教育：AI知识库如何重塑学习体验知识管理知识库知识库软件
在科技日新月异的今天，教育领域正经历着前所未有的变革。人工智能（AI）技术的快速发展，特别是AI知识库的广泛应用，正在重塑我们的学习体验，使之变得更加高效、个性化和智能化。本文将深入探讨AI知识库如何影响未来教育，以及它如何为学习者提供前所未有的学习体验。一、AI知识库：教育领域的智能助手AI知识库，作为结合了人工智能技术的知识管理系统，不仅能够存储和处理海量信息，还能通过自然语言处理、机器学习等
【TVM 教程】内联及数学函数
ApacheTVM是一个端到端的深度学习编译框架，适用于CPU、GPU和各种机器学习加速芯片。更多TVM中文文档可访问→https://tvm.hyper.ai/作者：TianqiChen尽管TVM支持基本的算术运算，但很多时候，也需要复杂的内置函数，例如exp取指函数。这些函数是依赖target系统的，并且在不同target平台中可能具有不同的名称。本教程会学习到如何调用这些target-spe
Nginx负载均衡 510888780 nginx 应用服务器
Nginx负载均衡一些基础知识: nginx 的 upstream目前支持 4 种方式的分配 1)、轮询（默认）每个请求按时间顺序逐一分配到不同的后端服务器，如果后端服务器down掉，能自动剔除。 2)、weight 指定轮询几率，weight和访问比率成正比
RedHat 6.4 安装 rabbitmq bylijinnan erlang rabbitmq redhat
在 linux 下安装软件就是折腾，首先是测试机不能上外网要找运维开通，开通后发现测试机的 yum 不能使用于是又要配置 yum 源，最后安装 rabbitmq 时也尝试了两种方法最后才安装成功机器版本： [root@redhat1 rabbitmq]# lsb_release LSB Version: :base-4.0-amd64:base-4.0-noarch:core
FilenameUtils工具类 eksliang FilenameUtils common-io
转载请出自出处：http://eksliang.iteye.com/blog/2217081 一、概述这是一个Java操作文件的常用库，是Apache对java的IO包的封装，这里面有两个非常核心的类FilenameUtils跟FileUtils，其中FilenameUtils是对文件名操作的封装;FileUtils是文件封装，开发中对文件的操作，几乎都可以在这个框架里面找到。非常的好用。
xml文件解析SAX 不懂事的小屁孩 xml
xml文件解析:xml文件解析有四种方式， 1.DOM生成和解析XML文档(SAX是基于事件流的解析) 2.SAX生成和解析XML文档(基于XML文档树结构的解析) 3.DOM4J生成和解析XML文档 4.JDOM生成和解析XML 本文章用第一种方法进行解析，使用android常用的DefaultHandler import org.xml.sax.Attributes;
通过定时任务执行mysql的定期删除和新建分区，此处是按日分区酷的飞上天空 mysql
使用python脚本作为命令脚本，linux的定时任务来每天定时执行 #!/usr/bin/python # -*- coding: utf8 -*- import pymysql import datetime import calendar #要分区的表 table_name = 'my_table' #连接数据库的信息 host,user,passwd,db =
如何搭建数据湖架构？听听专家的意见蓝儿唯美架构
Edo Interactive在几年前遇到一个大问题：公司使用交易数据来帮助零售商和餐馆进行个性化促销，但其数据仓库没有足够时间去处理所有的信用卡和借记卡交易数据 “我们要花费27小时来处理每日的数据量，”Edo主管基础设施和信息系统的高级副总裁Tim Garnto说道：“所以在2013年，我们放弃了现有的基于PostgreSQL的关系型数据库系统，使用了Hadoop集群作为公司的数
spring学习——控制反转与依赖注入 a-john spring
控制反转（Inversion of Control，英文缩写为IoC）是一个重要的面向对象编程的法则来削减计算机程序的耦合问题，也是轻量级的Spring框架的核心。控制反转一般分为两种类型，依赖注入（Dependency Injection，简称DI）和依赖查找（Dependency Lookup）。依赖注入应用比较广泛。
用spool+unixshell生成文本文件的方法 aijuans xshell
例如我们把scott.dept表生成文本文件的语句写成dept.sql,内容如下: 　　set pages 50000; 　　set lines 200; 　　set trims on; 　　set heading off; 　　spool /oracle_backup/log/test/dept.lst; 　　select deptno||','||dname||','||loc
1、基础--名词解析(OOA/OOD/OOP) asia007 学习基础知识
OOA:Object-Oriented Analysis（面向对象分析方法）是在一个系统的开发过程中进行了系统业务调查以后，按照面向对象的思想来分析问题。OOA与结构化分析有较大的区别。OOA所强调的是在系统调查资料的基础上，针对OO方法所需要的素材进行的归类分析和整理，而不是对管理业务现状和方法的分析。　　OOA（面向对象的分析）模型由5个层次（主题层、对象类层、结构层、属性层和服务层）
浅谈java转成json编码格式技术百合不是茶 json编码 java转成json编码
json编码;是一个轻量级的数据存储和传输的语言在java中需要引入json相关的包,引包方式在工程的lib下就可以了 JSON与JAVA数据的转换（JSON 即 JavaScript Object Natation，它是一种轻量级的数据交换格式，非常适合于服务器与 JavaScript 之间的数据的交
web.xml之Spring配置(基于Spring+Struts+Ibatis) bijian1013 java web.xml SSI spring配置
指定Spring配置文件位置 <context-param> <param-name>contextConfigLocation</param-name> <param-value> /WEB-INF/spring-dao-bean.xml,/WEB-INF/spring-resources.xml, /WEB-INF/
Installing SonarQube（Fail to download libraries from server） sunjing Install Sonar
1. Download and unzip the SonarQube distribution 2. Starting the Web Server The default port is "9000" and the context path is "/". These values can be changed in &l
【MongoDB学习笔记十一】Mongo副本集基本的增删查 bit1129 mongodb
一、创建复本集假设mongod,mongo已经配置在系统路径变量上，启动三个命令行窗口，分别执行如下命令： mongod --port 27017 --dbpath data1 --replSet rs0 mongod --port 27018 --dbpath data2 --replSet rs0 mongod --port 27019 -
Anychart图表系列二之执行Flash和HTML5渲染白糖_ Flash
今天介绍Anychart的Flash和HTML5渲染功能 HTML5 Anychart从6.0第一个版本起，已经逐渐开始支持各种图的HTML5渲染效果了，也就是说即使你没有安装Flash插件，只要浏览器支持HTML5，也能看到Anychart的图形（不过这些是需要做一些配置的）。这里要提醒下大家，Anychart6.0版本对HTML5的支持还不算很成熟，目前还处于
Laravel版本更新异常4.2.8-> 4.2.9 Declaration of ... CompilerEngine ... should be compa bozch laravel
昨天在为了把laravel升级到最新的版本，突然之间就出现了如下错误： ErrorException thrown with message "Declaration of Illuminate\View\Engines\CompilerEngine::handleViewException() should be compatible with Illuminate\View\Eng
编程之美-NIM游戏分析-石头总数为奇数时如何保证先动手者必胜 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class Nim { /**编程之美 NIM游戏分析问题：有N块石头和两个玩家A和B，玩家A先将石头随机分成若干堆，然后按照BABA...的顺序不断轮流取石头，能将剩下的石头一次取光的玩家获胜，每次取石头时，每个玩家只能从若干堆石头中任选一堆，
lunce创建索引及简单查询 chengxuyuancsdn 查询创建索引 lunce
import java.io.File; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Docume
[IT与投资]坚持独立自主的研究核心技术 comsci it
和别人合作开发某项产品....如果互相之间的技术水平不同,那么这种合作很难进行,一般都会成为强者控制弱者的方法和手段..... 所以弱者,在遇到技术难题的时候,最好不要一开始就去寻求强者的帮助,因为在我们这颗星球上,生物都有一种控制其
flashback transaction闪回事务查询 daizj oracle sql 闪回事务
闪回事务查询有别于闪回查询的特点有以下3个：（1）其正常工作不但需要利用撤销数据，还需要事先启用最小补充日志。（2）返回的结果不是以前的“旧”数据，而是能够将当前数据修改为以前的样子的撤销SQL（Undo SQL）语句。（3）集中地在名为flashback_transaction_query表上查询，而不是在各个表上通过“as of”或“vers
Java I/O之FilenameFilter类列举出指定路径下某个扩展名的文件游其是你 FilenameFilter
这是一个FilenameFilter类用法的例子，实现的列举出“c:\\folder“路径下所有以“.jpg”扩展名的文件。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
C语言学习五函数，函数的前置声明以及如何在软件开发中合理的设计函数来解决实际问题 dcj3sjt126com c
# include <stdio.h> int f(void) //括号中的void表示该函数不能接受数据，int表示返回的类型为int类型 { return 10; //向主调函数返回10 } void g(void) //函数名前面的void表示该函数没有返回值 { //return 10; //error 与第8行行首的void相矛盾 } in
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Pl dcj3sjt126com centos
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Please verify its path and try again 处理很简单，修改文件“/etc/yum.repos.d/epel.repo”，将baseurl的注释取消， mirrorlist注释掉。即可。 &n
单例模式 shuizhaosi888 单例模式
单例模式懒汉式 public class RunMain { /** * 私有构造 */ private RunMain() { } /** * 内部类，用于占位，只有 */ private static class SingletonRunMain { priv
Spring Security（09）——Filter 234390216 Spring Security
Filter 目录 1.1 Filter顺序 1.2 添加Filter到FilterChain 1.3 DelegatingFilterProxy 1.4 FilterChainProxy 1.5
公司项目NODEJS实践0.1 逐行分析JS源代码 mongodb nginx ubuntu nodejs
一、前言前端如何独立用nodeJs实现一个简单的注册、登录功能，是不是只用nodejs+sql就可以了？其实是可以实现，但离实际应用还有距离，那要怎么做才是实际可用的。网上有很多nod
java.lang.Math liuhaibo_ljf java Math lang
System.out.println(Math.PI); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1)); System.out.println(Math.abs(111111111)); System.out.println(Mat
linux下时间同步 nonobaba ntp
今天在linux下做hbase集群的时候，发现hmaster启动成功了，但是用hbase命令进入shell的时候报了一个错误 PleaseHoldException: Master is initializing，查看了日志，大致意思是说master和slave时间不同步，没办法，只好找一种手动同步一下，后来发现一共部署了10来台机器，手动同步偏差又比较大，所以还是从网上找现成的解决方
ZooKeeper3.4.6的集群部署 roadrunners zookeeper 集群部署
ZooKeeper是Apache的一个开源项目，在分布式服务中应用比较广泛。它主要用来解决分布式应用中经常遇到的一些数据管理问题，如：统一命名服务、状态同步、集群管理、配置文件管理、同步锁、队列等。这里主要讲集群中ZooKeeper的部署。 1、准备工作我们准备3台机器做ZooKeeper集群，分别在3台机器上创建ZooKeeper需要的目录。数据存储目录
Java高效读取大文件 tomcat_oracle java
　　读取文件行的标准方式是在内存中读取，Guava 和Apache Commons IO都提供了如下所示快速读取文件行的方法：　　Files.readLines(new File(path), Charsets.UTF_8); 　　FileUtils.readLines(new File(path)); 　　这种方法带来的问题是文件的所有行都被存放在内存中，当文件足够大时很快就会导致
微信支付api返回的xml转换为Map的方法 xu3508620 xml map 微信api
举例如下： <xml> <return_code><![CDATA[SUCCESS]]></return_code> <return_msg><![CDATA[OK]]></return_msg> <appid><