被忽略的双引号

被忽略的双引号

在一段从mht文件中提取html内容的程序中,包含如下代码:

 

                String strEncodng = getEncoding(bp1);
                String strText = getHtmlText(bp1, strEncodng);
 

 

在处理某个mht文件时,报如下错误:

 

java.io.UnsupportedEncodingException: "unicode"

 

 

于是,我猜想应该是 strEncodng 为 unicode 所致,可能是文件本身设置的编码有问题,改成别的试试。尝试了UTF8不行,尝试UTF16可以。

 

                String strEncodng = getEncoding(bp1);
                // strEncodng = "UTF8"
                strEncodng = "UTF16";
                String strText = getHtmlText(bp1, strEncodng);
 

 

当然程序不能这样写,否则别的mht文件就无法正确处理了。打个补丁,当编码为 unicode 时,改成 UTF16。

 

                String strEncodng = getEncoding(bp1);
                if (strEncodng.equals("unicode")) {
                    strEncodng = "UTF16";
                }
                String strText = getHtmlText(bp1, strEncodng);
 

 

再次测试,发现还是报上面的异常,怪哉。于是加了日志输出,看到底怎么回事

 

                String strEncodng = getEncoding(bp1);
                log.debug("strEncodng=" + strEncodng);
                if (strEncodng.equals("unicode")) {
                    strEncodng = "UTF16";
                }
                log.debug("strEncodng=" + strEncodng);
                String strText = getHtmlText(bp1, strEncodng);

 

 

执行,发现两次的日志输出一样,根本上就没有进入if判断。

 

01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode"
01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode"
 

 

难道 strEncodng 中包含特殊字符,输出看不见吗。于是又加了几行代码来确认

 

                String strEncodng = getEncoding(bp1);
                log.debug("strEncodng=" + strEncodng);
                if (strEncodng.equals("unicode")) {
                    strEncodng = "UTF16";
                } else {
                    log.debug("strEncodng.length=" + strEncodng.length());
                    log.debug("strEncodng.contains=" + strEncodng.contains("unicode"));
                    for (int i = 0; i < strEncodng.length(); ++i) {
                        log.debug("strEncodng[" + i + "]=" + strEncodng.charAt(i) + " " + (int) strEncodng.charAt(i));
                    }
                }
                log.debug("strEncodng=" + strEncodng);
                String strText = getHtmlText(bp1, strEncodng);
 

 

再次执行

 

01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode"
01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng.length=9
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng.contains=true
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[0]=" 34
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[1]=u 117
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[2]=n 110
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[3]=i 105
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[4]=c 99
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[5]=o 111
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[6]=d 100
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[7]=e 101
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[8]=" 34
01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode"
 

 

终于发现了,strEncodng 中不光包含unicode,在其前后还有双引号包裹着。应该只要把双引号去掉就可以了,于是又改了代码,如下

 

                String strEncodng = getEncoding(bp1);
                strEncodng = strEncodng.replace("\"", "");
                String strText = getHtmlText(bp1, strEncodng);
 

 

不错,通过了。都是那个被忽略的双引号啊。其实早在查看异常的时候和日志的时候就有些警觉,"uncode" 的双引号是字符串的一部分。

 

PS:在实际的测试中,strEncodng 还有可能为 null,加上此判断更加稳妥。

 

                String strEncodng = getEncoding(bp1);
                if (strEncodng == null) {
                    strEncodng = "GBK";
                } else {
                    strEncodng = strEncodng.replace("\"", "");
                }
                String strText = getHtmlText(bp1, strEncodng);
 

 

 

 

 

你可能感兴趣的:(被忽略的双引号)