一个对文本信息统计的熵增验证程序

文章目录

  • 简介
  • 熵的数学定义
  • 验证原理
  • 结论
  • 附:源代码

简介

熵增定律是基于第二热力学定律的著名定律,描述了任何一个系统,包括宇宙本身的熵都是不断增加的定律。本文通过对TXT格式的文章进行测试,以验证为熵增定律。

熵的数学定义

给定一离散分布变量 X X X,其可能值为 x 1 , x 2 , . . . , x n x_1, x_2, ..., x_n x1,x2,...,xn,每个值出现的对应概率为 P ( x 1 ) , P ( x 2 ) , . . . , P ( x i ) P(x_1), P(x_2), ..., P(x_i) P(x1),P(x2),...,P(xi),那么系统 H H H 关于 X X X 的信息熵可以定义为: H ( X ) = − ∑ i = 1 n P ( x i ) l o g P ( x i ) H(X)=-\sum_{i=1}^{n} P(x_i) logP(x_i) H(X)=i=1nP(xi)logP(xi)这个公式及其类似的变形很多地方都有定义,但是它到底表达了什么,本文以实例进行解释。

验证原理

根据公式,我们对文本文件中所有的字符进行统计,计算其熵。根据原理,越是有序的内容其熵应该越低,越是无序自然越高。笔者随机对一些文本进行了统计(源代码见末尾),结果如下所示:

----------- c:\Data\102个无聊网站.txt ------------
Length: 9.02 KB
Entropy = 4.7158

----------- c:\Data\18条忠告.txt ------------
Length: 5.41 KB
Entropy = 2.7741

----------- c:\Data\7件事说明成功.txt ------------
Length: 8.13 KB
Entropy = 3.0377

----------- c:\Data\C#开发DirectX.txt ------------
Length: 24.81 KB
Entropy = 4.9619

----------- c:\Data\entities1.txt ------------
Length: 20.92 KB
Entropy = 5.2982

----------- c:\Data\hlm.txt ------------
Length: 2464.21 KB
Entropy = 8.6316

----------- c:\Data\hlm1.txt ------------
Length: 7392.63 KB
Entropy = 8.6316

----------- c:\Data\input.txt ------------
Length: 19399.94 KB
Entropy = 7.7636

----------- c:\Data\List.txt ------------
Length: 1.07 KB
Entropy = 5.0984

----------- c:\Data\MOP顶极宅男恋爱史.txt ------------
Length: 58.16 KB
Entropy = 3.0086

----------- c:\Data\output.txt ------------
Length: 118.73 KB
Entropy = 4.0673

----------- c:\Data\res.txt ------------
Length: 3.93 KB
Entropy = 3.1446

----------- c:\Data\sample.txt ------------
Length: 196.25 KB
Entropy = 7.7117

----------- c:\Data\unicode.txt ------------
Length: 0.87 KB
Entropy = 6.7460

----------- c:\Data\不以物喜,不以己悲.txt ------------
Length: 3.50 KB
Entropy = 2.6829

----------- c:\Data\主导与影响世界的100个管理定律.txt ------------
Length: 5.77 KB
Entropy = 3.2032

----------- c:\Data\人人都需要掌握的18个世故人情.txt ------------
Length: 3.36 KB
Entropy = 2.5972

----------- c:\Data\以后的发展.txt ------------
Length: 2.70 KB
Entropy = 3.2418

----------- c:\Data\伪丢手机事件.txt ------------
Length: 0.67 KB
Entropy = 2.4646

----------- c:\Data\共享软件如何进军海外 .txt ------------
Length: 6.72 KB
Entropy = 3.6789

----------- c:\Data\决定一生的99个简单法则 .txt ------------
Length: 3.74 KB
Entropy = 2.8412

----------- c:\Data\十不要等.txt ------------
Length: 1.61 KB
Entropy = 2.4100

----------- c:\Data\单词2100强记忆法.txt ------------
Length: 43.98 KB
Entropy = 2.8009

----------- c:\Data\厚黑学.TXT ------------
Length: 393.48 KB
Entropy = 2.5912

----------- c:\Data\四级六和学位的重要性.txt ------------
Length: 3.78 KB
Entropy = 3.2485

----------- c:\Data\女人和短信.txt ------------
Length: 15.61 KB
Entropy = 3.7250

----------- c:\Data\小沈阳 废话集!!! 108条.txt ------------
Length: 6.74 KB
Entropy = 3.1106

----------- c:\Data\教你十条成为成熟男人的黄金法则.txt ------------
Length: 1.95 KB
Entropy = 2.4999

----------- c:\Data\新建 文本文档.txt ------------
Length: 3.42 KB
Entropy = 2.7682

----------- c:\Data\流氓就是流氓,不要拿“爱国”来说事!.txt ------------
Length: 2.14 KB
Entropy = 2.7445

----------- c:\Data\爱迪生的故事.txt ------------
Length: 9.18 KB
Entropy = 2.4569

----------- c:\Data\男人100条.txt ------------
Length: 10.47 KB
Entropy = 2.8571

----------- c:\Data\盖茨十大建议.txt ------------
Length: 1.34 KB
Entropy = 2.6005

----------- c:\Data\社会生存72条法则(推荐).txt ------------
Length: 4.69 KB
Entropy = 2.5794

----------- c:\Data\红楼梦.txt ------------
Length: 2464.17 KB
Entropy = 8.6313

----------- c:\Data\羊和狮子的故事.txt ------------
Length: 2.49 KB
Entropy = 2.5704

----------- c:\Data\血型与性格.txt ------------
Length: 0.68 KB
Entropy = 3.1348

----------- c:\Data\送给那些不懂女人的男人    1234 下页 末页 [只看楼主] [阅读全部] .txt ------------
Length: 2.29 KB
Entropy = 2.3640

----------- c:\Data\遥远的距离.txt ------------
Length: 3.08 KB
Entropy = 2.7832

----------- c:\Data\险被车撞事件.txt ------------
Length: 0.33 KB
Entropy = 2.3530

这些文件的结果我们可以看出一个重要的结论:越是小文件越是有序,所以熵值越小。由于这个文字都是写出来的文章,所以符合语法规则,具有一定的有序性。如果我们随机生成代码会怎么呢?于是笔者又用Java写了一个随机程序(见附录)用于统计随机值的熵,结果如下:

entropy = 13.86378601494719
entropy = 13.863806257137101
entropy = 13.864030595858974
entropy = 13.863822726347344
entropy = 13.863931775438516
entropy = 13.863801186344219
entropy = 13.863702885807156
entropy = 13.863861376780758
entropy = 13.863823648506052
entropy = 13.863946588612857

可见随机生成的内容,熵达到了13.86.

结论

根据本实验我们可以得出以下结论:

  • 内容越短,熵相对越小,因为越容易产生有序的信息
  • 文章的熵小于随机生成的文本的熵,因为随机值是无序的,所以熵应该最高。

附:源代码

  1. C#代码,用于统计文本内容
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace ConsoleApp1
{
     
	class Program
	{
     
		static void Main(string[] args)
		{
     
			foreach (var file in Directory.GetFiles("c:\\Data"))
			{
     
				if (file.ToLower().EndsWith("txt"))
				{
     
					var info = new FileInfo(file);
					if (info.Length > 0)
					{
     
						var entropy = GetEntropy(file);
						Console.WriteLine($"----------- {file} ------------\nLength: {info.Length / 1024.0:0.00} KB\nEntropy = {entropy:0.0000}\n\n");
					}
				}
			}
		}

		/// 
		/// compute the entropy of an article stored in a text file.
		/// 
		/// Text file to be open.
		/// The encoding of input file.
		/// 
		public static double GetEntropy(string file, string encoding = "utf-8")
		{
     
			double entropy = 0;
			string text = File.ReadAllText(file, encoding == "utf-8" ? Encoding.UTF8 : Encoding.Default);
			if (string.IsNullOrEmpty(text))
				return entropy;

			// count the occurence of each character
			int[] chs = new int[65536];
			foreach (var item in text)
				chs[(int)item]++;

			// compute the odd of each character
			double[] odds = new double[chs.Length];
			for (int i = 0; i < chs.Length; i++)
				odds[i] = 1.0 * chs[i] / text.Length;

			// compute entropy of variable that greater than 0.
			foreach (var odd in odds)
				if (odd > 0)
					entropy += -odd * Math.Log2(odd);

			return entropy;
		}

		private static void Analyze(string text, int[] chs)
		{
     
			Dictionary<char, double> dic = new Dictionary<char, double>();
			StringBuilder sb = new StringBuilder();
			sb.AppendLine("Total Length: " + text.Length);
			for (int i = 0; i < chs.Length; i++)
			{
     
				if (chs[i] > 0)
				{
     
					double odd = 1.0 * chs[i] / text.Length;
					sb.AppendLine($"{i:00000}\t{(char)i}\t{chs[i]}\t{odd:0.0000000000}");
					dic.Add((char)i, chs[i] / text.Length);
				}
			}
			File.WriteAllText(@"C:\data\output.txt", sb.ToString());
		}
	}
}

  1. Java代码用于统计随机生成的数据的熵
import java.util.Random;

public class EntropyTest {
     

	public static void main(String[] args) {
     
		for(int i = 0 ; i < 10; i ++)
			test();
	}
	
	static void test() {
     
		int runTimes = 1200*1024; // 
		int[] data = new int[15000];
		Random rand = new Random();
		for (int i = 0; i < runTimes; i++)
			data[rand.nextInt(data.length)]++;

		// System.out.println(Arrays.toString(data));

		double e = 0;
		double log2 = Math.log(2);
		for (int i = 0; i < data.length; i++) {
     
			if (data[i] > 0) {
     
				double odd = 1.0 * data[i] / runTimes;
				e += -odd * Math.log(odd) / log2;
			}
		}

		System.out.println("entropy = " + e);
	}
}

你可能感兴趣的:(C#语言详解,算法设计与分析,Java程序设计,熵增实验,随机文本测试)