判断是否utf8编码的算法

参考:http://www.cnblogs.com/powertoolsteam/archive/2010/09/20/1831638.html

问题:由于utf8编码分为有bom头和无bom头,而有bom头的编码很好判断,如果前三个字节是0xE 0xBB 0xBF,那么就是utf8带bom的编码,如果没有bom头,则不能好快的判断到底是utf8还是ansi编码,因为他们都没有头,且全英文的编码是兼容的,者就是问题所在。

解决方案:

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理:

UTF8的编码规则如下表

看起来很复杂,总结起来如下:

ASCII码(U+0000 - U+007F),不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte) 
•结下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。 
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则,我给出的C#代码如下:

///
///   Determines whether the given is UTF8 encoding bytes.
///
///
///    The input stream.
/// 
///
///   if given bystes stream is in UTF8 encoding; otherwise, .
///
///
///   All ASCII chars will regards not UTF8 encoding.
///
public  static  bool  IsTextUTF8( ref  byte [] inputStream)
{
     int  encodingBytesCount = 0;
     bool  allTextsAreASCIIChars = true ;
 
     for  ( int  i = 0; i < inputStream.Length; i++)
     {
         byte  current = inputStream[i];
 
         if  ((current & 0x80) == 0x80)
         {                   
             allTextsAreASCIIChars = false ;
         }
         // First byte
         if  (encodingBytesCount == 0)
         {
             if  ((current & 0x80) == 0)
             {
                 // ASCII chars, from 0x00-0x7F
                 continue ;
             }
 
             if  ((current & 0xC0) == 0xC0)
             {
                 encodingBytesCount = 1;
                 current <<= 2;
 
                 // More than two bytes used to encoding a unicode char.
                 // Calculate the real length.
                 while  ((current & 0x80) == 0x80)
                 {
                     current <<= 1;
                     encodingBytesCount++;
                 }
             }                   
             else
             {
                 // Invalid bits structure for UTF8 encoding rule.
                 return  false ;
             }
         }               
         else
         {
             // Following bytes, must start with 10.
             if  ((current & 0xC0) == 0x80)
             {                       
                 encodingBytesCount--;
             }
             else
             {
                 // Invalid bits structure for UTF8 encoding rule.
                 return  false ;
             }
         }
     }
 
     if  (encodingBytesCount != 0)
     {
         // Invalid bits structure for UTF8 encoding rule.
         // Wrong following bytes count.
         return  false ;
     }
 
     // Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.
     return  !allTextsAreASCIIChars;
}

 

 

再附上单元测试代码:

 

///
///This is a test class for EncodingHelperTest and is intended
///to contain all EncodingHelperTest Unit Tests
///
[TestClass()]
public  class  EncodingHelperTest
{
     ///
     ///  Normal test for this method.
     ///
     [TestMethod()]
     public  void  IsTextUTF8Test()
     {
         for  ( int  i = 0; i < 1000; i++)
         {
             List chars = new  List< char >();
             chars.Add( '中' );
 
             List temp = new  List();
             Random rd = new  Random(( int )(DateTime.Now.Ticks & 0x7FFFFFFF));
 
             for  ( int  j = 0; j < 255; j++)
             {
                 char  ch = ( char )rd.Next(0xFFFF);
                 UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);
                 if  (uc == UnicodeCategory.Surrogate || // Single surrogate could not be encoding correctly.
                     uc == UnicodeCategory.PrivateUse || // Private use blocks should be excluded.
                     uc == UnicodeCategory.OtherNotAssigned
                     )
                 {
                     j--;
                 }
                 else
                 {
                     chars.Add(ch);
                     temp.Add(uc);
                 }
             }
 
             string  str = new  string (chars.ToArray());
 
             byte [] inputStream = Encoding.UTF8.GetBytes(str);
             bool  expected = true ;
             bool  actual;
             actual = EncodingHelper.IsTextUTF8( ref  inputStream);
             Assert.AreEqual(expected, actual, string .Format( "UTF8_Assert Fails at:{0}" , str));
 
             inputStream = Encoding.GetEncoding(932).GetBytes(str);
             expected = false ;
 
             actual = EncodingHelper.IsTextUTF8( ref  inputStream);
             Assert.AreEqual(expected, actual, string .Format( "ShiftJIS_Assert Fails at:{0}" , str));
         }
     }
 
     ///
     ///   Check with All ASCII chars
     ///
     [TestMethod]
     public  void  IsTextUTF8Test_AllASCII()
     {
         string  str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD" ;
 
         byte [] inputStream = Encoding.UTF8.GetBytes(str);
         bool  expected = false ;
         bool  actual;
         actual = EncodingHelper.IsTextUTF8( ref  inputStream);
         Assert.AreEqual(expected, actual, string .Format( "UTF8_Assert Fails at:{0}" , str));
 
 
     }
}

 

另:

如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

参考:

维基百科:http://en.wikipedia.org/wiki/UTF-8


你可能感兴趣的:(字符编码)