熵增定律是基于第二热力学定律的著名定律,描述了任何一个系统,包括宇宙本身的熵都是不断增加的定律。本文通过对TXT格式的文章进行测试,以验证为熵增定律。
给定一离散分布变量 X X X,其可能值为 x 1 , x 2 , . . . , x n x_1, x_2, ..., x_n x1,x2,...,xn,每个值出现的对应概率为 P ( x 1 ) , P ( x 2 ) , . . . , P ( x i ) P(x_1), P(x_2), ..., P(x_i) P(x1),P(x2),...,P(xi),那么系统 H H H 关于 X X X 的信息熵可以定义为: H ( X ) = − ∑ i = 1 n P ( x i ) l o g P ( x i ) H(X)=-\sum_{i=1}^{n} P(x_i) logP(x_i) H(X)=−i=1∑nP(xi)logP(xi)这个公式及其类似的变形很多地方都有定义,但是它到底表达了什么,本文以实例进行解释。
根据公式,我们对文本文件中所有的字符进行统计,计算其熵。根据原理,越是有序的内容其熵应该越低,越是无序自然越高。笔者随机对一些文本进行了统计(源代码见末尾),结果如下所示:
----------- c:\Data\102个无聊网站.txt ------------
Length: 9.02 KB
Entropy = 4.7158
----------- c:\Data\18条忠告.txt ------------
Length: 5.41 KB
Entropy = 2.7741
----------- c:\Data\7件事说明成功.txt ------------
Length: 8.13 KB
Entropy = 3.0377
----------- c:\Data\C#开发DirectX.txt ------------
Length: 24.81 KB
Entropy = 4.9619
----------- c:\Data\entities1.txt ------------
Length: 20.92 KB
Entropy = 5.2982
----------- c:\Data\hlm.txt ------------
Length: 2464.21 KB
Entropy = 8.6316
----------- c:\Data\hlm1.txt ------------
Length: 7392.63 KB
Entropy = 8.6316
----------- c:\Data\input.txt ------------
Length: 19399.94 KB
Entropy = 7.7636
----------- c:\Data\List.txt ------------
Length: 1.07 KB
Entropy = 5.0984
----------- c:\Data\MOP顶极宅男恋爱史.txt ------------
Length: 58.16 KB
Entropy = 3.0086
----------- c:\Data\output.txt ------------
Length: 118.73 KB
Entropy = 4.0673
----------- c:\Data\res.txt ------------
Length: 3.93 KB
Entropy = 3.1446
----------- c:\Data\sample.txt ------------
Length: 196.25 KB
Entropy = 7.7117
----------- c:\Data\unicode.txt ------------
Length: 0.87 KB
Entropy = 6.7460
----------- c:\Data\不以物喜,不以己悲.txt ------------
Length: 3.50 KB
Entropy = 2.6829
----------- c:\Data\主导与影响世界的100个管理定律.txt ------------
Length: 5.77 KB
Entropy = 3.2032
----------- c:\Data\人人都需要掌握的18个世故人情.txt ------------
Length: 3.36 KB
Entropy = 2.5972
----------- c:\Data\以后的发展.txt ------------
Length: 2.70 KB
Entropy = 3.2418
----------- c:\Data\伪丢手机事件.txt ------------
Length: 0.67 KB
Entropy = 2.4646
----------- c:\Data\共享软件如何进军海外 .txt ------------
Length: 6.72 KB
Entropy = 3.6789
----------- c:\Data\决定一生的99个简单法则 .txt ------------
Length: 3.74 KB
Entropy = 2.8412
----------- c:\Data\十不要等.txt ------------
Length: 1.61 KB
Entropy = 2.4100
----------- c:\Data\单词2100强记忆法.txt ------------
Length: 43.98 KB
Entropy = 2.8009
----------- c:\Data\厚黑学.TXT ------------
Length: 393.48 KB
Entropy = 2.5912
----------- c:\Data\四级六和学位的重要性.txt ------------
Length: 3.78 KB
Entropy = 3.2485
----------- c:\Data\女人和短信.txt ------------
Length: 15.61 KB
Entropy = 3.7250
----------- c:\Data\小沈阳 废话集!!! 108条.txt ------------
Length: 6.74 KB
Entropy = 3.1106
----------- c:\Data\教你十条成为成熟男人的黄金法则.txt ------------
Length: 1.95 KB
Entropy = 2.4999
----------- c:\Data\新建 文本文档.txt ------------
Length: 3.42 KB
Entropy = 2.7682
----------- c:\Data\流氓就是流氓,不要拿“爱国”来说事!.txt ------------
Length: 2.14 KB
Entropy = 2.7445
----------- c:\Data\爱迪生的故事.txt ------------
Length: 9.18 KB
Entropy = 2.4569
----------- c:\Data\男人100条.txt ------------
Length: 10.47 KB
Entropy = 2.8571
----------- c:\Data\盖茨十大建议.txt ------------
Length: 1.34 KB
Entropy = 2.6005
----------- c:\Data\社会生存72条法则(推荐).txt ------------
Length: 4.69 KB
Entropy = 2.5794
----------- c:\Data\红楼梦.txt ------------
Length: 2464.17 KB
Entropy = 8.6313
----------- c:\Data\羊和狮子的故事.txt ------------
Length: 2.49 KB
Entropy = 2.5704
----------- c:\Data\血型与性格.txt ------------
Length: 0.68 KB
Entropy = 3.1348
----------- c:\Data\送给那些不懂女人的男人 1234 下页 末页 [只看楼主] [阅读全部] .txt ------------
Length: 2.29 KB
Entropy = 2.3640
----------- c:\Data\遥远的距离.txt ------------
Length: 3.08 KB
Entropy = 2.7832
----------- c:\Data\险被车撞事件.txt ------------
Length: 0.33 KB
Entropy = 2.3530
这些文件的结果我们可以看出一个重要的结论:越是小文件越是有序,所以熵值越小。由于这个文字都是写出来的文章,所以符合语法规则,具有一定的有序性。如果我们随机生成代码会怎么呢?于是笔者又用Java写了一个随机程序(见附录)用于统计随机值的熵,结果如下:
entropy = 13.86378601494719
entropy = 13.863806257137101
entropy = 13.864030595858974
entropy = 13.863822726347344
entropy = 13.863931775438516
entropy = 13.863801186344219
entropy = 13.863702885807156
entropy = 13.863861376780758
entropy = 13.863823648506052
entropy = 13.863946588612857
可见随机生成的内容,熵达到了13.86.
根据本实验我们可以得出以下结论:
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
foreach (var file in Directory.GetFiles("c:\\Data"))
{
if (file.ToLower().EndsWith("txt"))
{
var info = new FileInfo(file);
if (info.Length > 0)
{
var entropy = GetEntropy(file);
Console.WriteLine($"----------- {file} ------------\nLength: {info.Length / 1024.0:0.00} KB\nEntropy = {entropy:0.0000}\n\n");
}
}
}
}
///
/// compute the entropy of an article stored in a text file.
///
/// Text file to be open.
/// The encoding of input file.
///
public static double GetEntropy(string file, string encoding = "utf-8")
{
double entropy = 0;
string text = File.ReadAllText(file, encoding == "utf-8" ? Encoding.UTF8 : Encoding.Default);
if (string.IsNullOrEmpty(text))
return entropy;
// count the occurence of each character
int[] chs = new int[65536];
foreach (var item in text)
chs[(int)item]++;
// compute the odd of each character
double[] odds = new double[chs.Length];
for (int i = 0; i < chs.Length; i++)
odds[i] = 1.0 * chs[i] / text.Length;
// compute entropy of variable that greater than 0.
foreach (var odd in odds)
if (odd > 0)
entropy += -odd * Math.Log2(odd);
return entropy;
}
private static void Analyze(string text, int[] chs)
{
Dictionary<char, double> dic = new Dictionary<char, double>();
StringBuilder sb = new StringBuilder();
sb.AppendLine("Total Length: " + text.Length);
for (int i = 0; i < chs.Length; i++)
{
if (chs[i] > 0)
{
double odd = 1.0 * chs[i] / text.Length;
sb.AppendLine($"{i:00000}\t{(char)i}\t{chs[i]}\t{odd:0.0000000000}");
dic.Add((char)i, chs[i] / text.Length);
}
}
File.WriteAllText(@"C:\data\output.txt", sb.ToString());
}
}
}
import java.util.Random;
public class EntropyTest {
public static void main(String[] args) {
for(int i = 0 ; i < 10; i ++)
test();
}
static void test() {
int runTimes = 1200*1024; //
int[] data = new int[15000];
Random rand = new Random();
for (int i = 0; i < runTimes; i++)
data[rand.nextInt(data.length)]++;
// System.out.println(Arrays.toString(data));
double e = 0;
double log2 = Math.log(2);
for (int i = 0; i < data.length; i++) {
if (data[i] > 0) {
double odd = 1.0 * data[i] / runTimes;
e += -odd * Math.log(odd) / log2;
}
}
System.out.println("entropy = " + e);
}
}