LDA perplexity计算


LDA程序使用的是JgibbLDA,根据其输出的结果,一师兄给了如下的计算perplexity函数

/**
	 * @param tw_list是topic word矩阵(.phi文件)的每一行
	 * @param dt_list是document topic 矩阵(.theta)的每一行
	 * @param as_list是 .tassign文件的每一行
	 * */
	public double compute(ArrayList tw_list, ArrayList dt_list, ArrayList as_list){
		
		double perp = 0.0;
		double sum_ln_pt = 0.0;
		int sum_t = 0;		
		double sum_value = 0.0;
		for(int i = 0; i < as_list.size(); i++){///每一行 对应一个 document

			String as_arr[] = as_list.get(i);
			sum_t += as_arr.length; // 那个 ducument 有多少 word
			double pt = 0.0;//用来加总 每一个document的 log概率
			for(String as:as_arr){对于每一个词
				if(!as.isEmpty()){
					//System.out.println("as = " + as);
					double pz = 0.0;
					String tz_arr[] = as.split(":");
					int t_id = Integer.parseInt(tz_arr[0]);// t_id 是 词 的id
					//int z_id = Integer.parseInt(tz_arr[1]);
					for(int j = 0; j < tw_list.size(); j++){ /遍历topics 
						double p_tz = 0.0;
						p_tz = Double.parseDouble(tw_list.get(j)[t_id]);// 第 j 个topic 第 t_id个词
						double p_zd = 0.0;
						p_zd = Double.parseDouble(dt_list.get(i)[j]);//第i篇document 第j个topic					
						pz += p_tz*p_zd;				
					}//end for(int j = 0; j < tz_list.size(); j++)
					pt += Math.log(pz);	
				}
				
				
			}//end for(String as:as_arr)
			sum_ln_pt += pt;
		}//end for i
		//System.out.println("sum_ln_pt = " + sum_ln_pt);
		//System.out.println("sum_t = " + sum_t);
		perp = Math.pow(Math.E, (-sum_ln_pt/sum_t));
		return perp;
	}

以测试文件test_input.txt为例,参数等见上一篇文章“ JgibbLDA输出结果说明与示例

4
sport Spanish football association competition club tickets scored win winners keeper shots best goal campaign season's Champions League 
France team France Football Federation president national team training session Champions record European competition without recording a single victory 
quit my job to travel passport world travel is a luxury for the privileged the rich or the retired travel stories Have a long-term plan visa-free destinations Central Station 
City of London dry gin drinking building older foundations River Fleet flavour gin and tonic be served with cubed ice fruit floral spicy earthy savoury citrus

选择不同的 number of topic输出不同的结果,如下:

Topic No. = 1 perplexity = 76.64751047832225

Topic No. = 2 perplexity = 47.257884731936784

Topic No. = 3 perplexity = 43.45364540615705

Topic No. = 4 perplexity = 48.62134551577707

可见 topic number 为3是比较好的

--------- end ---------

你可能感兴趣的:(自然语言处理)