1、ANSI编码:American National Standards Institute,即美国国家标准协会。当记事本或者软件采用Windows代码页中对应的"ANSI"编码时,在不同地区"ANSI"编码是不同的,在中国,"ANSI"就是GBK编码
2、ASCII编码:American Standard Code for Information Interchange,即美国信息交换标准代码。只能表示符合、字母、数字等,无法表示汉字,具体见:ASCII码对照表
3、通常情况下,一个汉字,utf-8编码时需要三个字节,gbk需要两个
注意:微软自带的记事本,使用utf-8编码时,会在最前面加上三个字节,因此不推荐使用记事本,推荐使用notepad
/**
* Reads a byte of data from this input stream. This method blocks
* if no input is yet available.
*
* @return the next byte of data, or -1
if the end of the
* file is reached.
* @exception IOException if an I/O error occurs.
*/
public int read() throws IOException {
return read0();
}
注释的意思是:从该字节流读取一个字节的数据,并返回该数据的下一个字节,当到达该文件的末尾时,返回-1
光看字面意思看,不太好理解其真正内涵,下面用代码进行演示,加深理解
1、在本地建一个test.txt文件,简单的写入一个字“你”,并将编码设置为ANSI,从上面的讲解可以得知,中文地区,该编码就是gbk编码
2、编码如下
@org.junit.Test
public void test() throws Exception {
FileInputStream is = new FileInputStream("f:/test.txt");
int b ;
while ((b = is.read()) != -1) {
System.out.println(b);
}
}
输出结果如下:
196
227
通过查看gbk字符编码可知,“你”字的GBK编码2进制的值为:11000100 11100011,其中前8位转换为十进制的值是196,后8位是227,跟上述输出结果完全一致
所以我们再对read()方法的注释进行理解,该方法的返回值,就是读取到的8位二进制的十进制的值
首先看定义
/**
* Reads up to b.length
bytes of data from this input
* stream into an array of bytes. This method blocks until some input
* is available.
*
* @param b the buffer into which the data is read.
* @return the total number of bytes read into the buffer, or
* -1
if there is no more data because the end of
* the file has been reached.
* @exception IOException if an I/O error occurs.
*/
public int read(byte b[]) throws IOException {
return readBytes(b, 0, b.length);
}
从该输入流中最高读取b.length个字节的数据,并存放到该字节数组b中。返回值的含义是,返回该缓冲区b中字节的总数,当到达该文件的末尾时,返回-1
然后我们再看该方法里面
/**
* Reads a subarray as a sequence of bytes.
* @param b the data to be written
* @param off the start offset in the data
* @param len the number of bytes that are written
* @exception IOException If an I/O error has occurred.
*/
private native int readBytes(byte b[], int off, int len) throws IOException;
该方法是native类型,无法查看具体实现细节,不过我们可以参考InputStream中的read(byte b[])方法
public int read(byte b[]) throws IOException {
return read(b, 0, b.length);
}
public int read(byte b[], int off, int len) throws IOException {
if (b == null) {
throw new NullPointerException();
} else if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return 0;
}
int c = read();
if (c == -1) {
return -1;
}
b[off] = (byte)c;
int i = 1;
try {
for (; i < len ; i++) {
c = read();
if (c == -1) {
break;
}
b[off + i] = (byte)c;
}
} catch (IOException ee) {
}
return i;
}
看其中的第二个方法,再加上我们对read()方法的理解,可以看出:里面有一个for循环,每次读取一个字节的数据,并将读取到的数据存放的b[]数组里,返回值是实际读到的字节个数,小于等于len
推荐使用如下方法读取文本值
/**
* 将输入流转换为字节数组
* @param in
* @return
* @throws IOException
*/
private byte[] toByteArray(InputStream in) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buffer = new byte[1024 * 4];
int n ;
while ((n = in.read(buffer)) != -1) {
out.write(buffer, 0, n);
}
return out.toByteArray();
}
其中最关键的是ByteArrayOutputStream,每次读取4KB的数据,并缓存在该类的对象中,最终转换为字节数组