最近有遇到需要用go读取大文件的情况,顺路研究了一下go几种读取文件方式的效率。
使用os包内的open和read。
fi, err := os.Open(path) // 打开文件
buf := make([]byte, 1024)
n, err := fi.Read(buf) // 读取内容
使用buffered io
fi, err := os.Open(path)
r := bufio.NewReader(fi)
buf := make([]byte, 1024)
n, err := r.Read(buf)
使用ioutil包内的方法
fi, err := os.Open(path)
fd, err := ioutil.ReadAll(fi)
准备了待读取文件信息如下:
total 720912
-rw-r--r-- 1 stephen staff 2.3K Sep 15 11:59 io_demo.go
-rw-r--r-- 1 stephen staff 336M Sep 15 11:59 test.txt
同时io_demo.go文件中的代码如下:
package main
import (
"bufio"
"fmt"
"io"
"io/ioutil"
"os"
"time"
)
func readRaw(path string) string {
start := time.Now()
fi, err := os.Open(path)
if err != nil {
panic(err)
}
defer fi.Close()
defer func() {
fi.Close()
fmt.Printf("[readRaw] cost time %v \n", time.Now().Sub(start))
}()
var data []byte
buf := make([]byte, 1024)
for {
n, err := fi.Read(buf)
if err != nil && err != io.EOF {
panic(err)
}
data = append(data, buf[:n]...)
if 0 == n {
break
}
}
return string(data)
}
func readWithBufferIO(path string) string {
start := time.Now()
fi, err := os.Open(path)
if err != nil {
panic(err)
}
defer func() {
fi.Close()
fmt.Printf("[readWithBufferIO] cost time %v \n", time.Now().Sub(start))
}()
r := bufio.NewReader(fi)
var data []byte
buf := make([]byte, 1024)
for {
n, err := r.Read(buf)
if err != nil && err != io.EOF {
panic(err)
}
if 0 == n {
break
}
data = append(data, buf[:n]...)
}
return string(data)
}
func readWithIOUtil(path string) string {
start := time.Now()
fi, err := os.Open(path)
if err != nil {
panic(err)
}
defer func() {
fi.Close()
fmt.Printf("[readWithIOUtil] cost time %v \n", time.Now().Sub(start))
}()
fd, err := ioutil.ReadAll(fi)
return string(fd)
}
func main() {
file := "test.txt"
readRaw(file)
readWithBufferIO(file)
readWithIOUtil(file)
}
用如上代码读取已准备的文件,多次测试用时信息如下(进行了超过10次测试,仅取了两个结果来说明问题):
[readRaw] cost time 1.490717874s
[readWithBufferIO] cost time 573.336617ms
[readWithIOUtil] cost time 379.678285ms
[readRaw] cost time 1.45133396s
[readWithBufferIO] cost time 541.944555ms
[readWithIOUtil] cost time 983.909509ms
可以看到,毫无疑问使用os包readRaw读取的方式是最慢的,且相比其他两种方式要慢很多。但是readWithBufferIO和readWithIOUtil 两种方式速度的快慢就很难分伯仲了。
既然得到了这个结论,那么我们来看看为什么会这样。
看bufio源码
// NewReader returns a new Reader whose buffer has the default size.
func NewReader(rd io.Reader) *Reader {
return NewReaderSize(rd, defaultBufSize)
}
再看NewReaderSize方法
// NewReaderSize returns a new Reader whose buffer has at least the specified
// size. If the argument io.Reader is already a Reader with large enough
// size, it returns the underlying Reader.
func NewReaderSize(rd io.Reader, size int) *Reader {
// Is it already a Reader?
b, ok := rd.(*Reader)
if ok && len(b.buf) >= size {
return b
}
if size < minReadBufferSize {
size = minReadBufferSize
}
r := new(Reader)
r.reset(make([]byte, size), rd)
return r
}
bufferio默认创建一个大小为4096 byte的缓冲区,它的 read 方法执行一次IO系统调用读取4096byte(4K)大小到缓冲区,此后r.Read(buf)
都会从缓冲区中读。而普通io每次读/写操作都会执行系统调用,必然会比bufferIO慢很多,毕竟每次系统调用都会从执行从用户态到内核态的切换。
来看ioutil
源码
// MinRead is the minimum slice size passed to a Read call by
// Buffer.ReadFrom. As long as the Buffer has at least MinRead bytes beyond
// what is required to hold the contents of r, ReadFrom will not grow the
// underlying buffer.
const MinRead = 512
// ReadAll reads from r until an error or EOF and returns the data it read.
// A successful call returns err == nil, not err == EOF. Because ReadAll is
// defined to read from src until EOF, it does not treat an EOF from Read
// as an error to be reported.
func ReadAll(r io.Reader) ([]byte, error) {
return readAll(r, bytes.MinRead)
}
// readAll reads from r until an error or EOF and returns the data it read
// from the internal buffer allocated with a specified capacity.
func readAll(r io.Reader, capacity int64) (b []byte, err error) {
var buf bytes.Buffer
// If the buffer overflows, we will get bytes.ErrTooLarge.
// Return that as an error. Any other panic remains.
defer func() {
e := recover()
if e == nil {
return
}
if panicErr, ok := e.(error); ok && panicErr == bytes.ErrTooLarge {
err = panicErr
} else {
panic(e)
}
}()
if int64(int(capacity)) == capacity {
buf.Grow(int(capacity))
}
_, err = buf.ReadFrom(r)
return buf.Bytes(), err
}
可以看到,ioutil.ReadAll
最后实现的也是一个带缓冲的IO,且大小在512byte以上,且使用的是bytes.Buffer,可以根据情况动态的增长。但是的Grow时重新分配buf也会带来一些开销,所以两者相比就变成了一个权衡,没有绝对占优。
但是ioutil的好处就是方便,ioutil.ReadAll
或者ioutil.ReadFile
一行代码就搞定。