openoffice将txt文本转pdf中文乱码

问题描述：

使用openoffice将txt文本转pdf的过程中发现中文乱码。

解决思路及过程：

1、查看出现乱码的原因

经查询jodconverter源码发现，只有utf-8编码的文本才不会中文乱码。

2、怎么样将非utf-8编码文件转换成utf-8文件。

要转之前首先要判断txt文本本身的编码。经查发现txt文本有一个头。

判断方法如下

 /**
     * 根据文件路径返回文件编码
     * @param filePath
     * @return
     * @throws IOException
     */
    public static String getCharset(String filePath) throws IOException{  
        BufferedInputStream bin = new BufferedInputStream(new FileInputStream(
                filePath));
        int p = (bin.read() << 8) + bin.read();
        String code = null;
 
        switch (p) {
        case 0xefbb:
            code = "UTF-8";
            break;
        case 0xfffe:
            code = "Unicode";
            break;
        case 0xfeff:
            code = "UTF-16";
            break;
        default:
            code = "GB2312";
        }
      System.out.println(code);
        return code;  
  }

转换代码如下

 /** 
     * 以指定编码方式写文本文件，存在会覆盖 
     *  
     * @param file 
     *            要写入的文件 
     * @param toCharsetName 
     *            要转换的编码 
     * @param content 
     *            文件内容 
     * @throws Exception 
     */  
    public static void saveFile2Charset(File file, String toCharsetName,  
            String content) throws Exception {  
        if (!Charset.isSupported(toCharsetName)) {  
            throw new UnsupportedCharsetException(toCharsetName);  
        }  
        OutputStream outputStream = new FileOutputStream(file);  
        
        OutputStreamWriter outWrite = new OutputStreamWriter(outputStream,  
                toCharsetName);  
        outWrite.write(content);  
        outWrite.close();  
    }

经测试发现，转换后的文本，获取的头还是gbk的，只有手机将头文件中blob生成

代码如下：

  
    /** 
     * 以指定编码方式写文本文件，存在会覆盖 
     *  
     * @param file 
     *            要写入的文件 
     * @param toCharsetName 
     *            要转换的编码 
     * @param content 
     *            文件内容 
     * @throws Exception 
     */  
    public static void saveFile2Charset(File file, String toCharsetName,  
            String content) throws Exception {  
        if (!Charset.isSupported(toCharsetName)) {  
            throw new UnsupportedCharsetException(toCharsetName);  
        }  
        OutputStream outputStream = new FileOutputStream(file);  
        //增加头文件标识
        outputStream.write(new byte[]{(byte)0xEF, (byte)0xBB, (byte)0xBF});  
        OutputStreamWriter outWrite = new OutputStreamWriter(outputStream,  
                toCharsetName);  
        outWrite.write(content);  
        outWrite.close();  
    }

经测试

GB2312
Unicode
UTF-16
UTF-8
都成功。

txt编码和头文件说明

java编码与txt编码对应
java	txt
unicode	unicode big endian
utf-8	utf-8
utf-16	unicode
gb2312	ANSI

什么是BOM

BOM（byte-order mark），即字节顺序标记，它是插入到以UTF-8、UTF16或UTF-32编码Unicode文件开头的特殊标记，用来识别Unicode文件的编码类型。对于UTF-8来说，BOM并不是必须的，因为BOM用来标记多字节编码文件的编码类型和字节顺序（big-endian或little- endian）。

BOMs 文件头:

00 00 FE FF = UTF-32, big-endian

FF FE 00 00 = UTF-32, little-endian

EF BB BF = UTF-8,

FE FF = UTF-16, big-endian

FF FE = UTF-16, little-endian

注：jodconverter 2.2.1不支持docx 、xlsx、ppt、文件转pdf

openoffice将txt文本转pdf中文乱码

你可能感兴趣的:(openoffice将txt文本转pdf中文乱码)