相关历史文章(阅读本文之前,您可能需要先看下之前的系列)
国内最全的Spring Boot系列之三
没有预热,不叫高并发「限流算法第三把法器:令牌桶算法」- 第302篇
水满自溢「限流算法第四把法器:漏桶算法」- 第303篇
一分钟get:缓存穿透、缓存击穿、缓存雪崩 - 第304篇
布隆过滤器Bloom Filter竟然让我解决了一个大厂的问题 - 第305篇
100G的文件如何读取 - 第306篇
师傅:徒儿,睡醒了没有,赶紧起床学习了
悟纤:师傅,这不天还没亮嘛?
师傅:学习要趁早,没听过早起的鸟有虫嘛!
悟纤:晚起的鸟儿也有虫吃呀,晚起的鸟儿吃晚起的虫。
师傅:是,是,你都说的都对,你再不起来,午饭都快没了。
悟纤:欧侯,师傅,现在不会是快到下午了吧。
师傅:是呀,你现在才发现,太阳都晒到你屁股了。
悟纤:(#^.^#) ….
师傅:赶紧吃饭,学习来…
文章目录
一、大文件读取之文件分割法
二、大文件读取之多线程读取
三、悟纤小结
一、大文件读取之文件分割法
我们来看下这种方法的核心思路就是:不是文件太大了嘛?那么是否可以把文件拆分成几个小的文件,然后使用多线程进行读取呐?具体的步骤:
(1)先分割成多个文件。
(2)多个线程操作多个文件,避免两个线程操作同一个文件
(3)按行读文件
1.1 文件分割
在Mac和Linux都有文件分割的命令,可以使用:
split -b 1024m test2.txt /data/tmp/my/test.txt.
说明:
(1)split:分割命令;
(2)-b 1024m:指定每多少字就要切成一个小文件。支持单位:m,k;这里是将6.5G的文件按照1G进行拆分成7个文件左右。
(3)test2.txt:要分割的文件;
(4)test.txt. : 切割后文件的前置文件名,split会自动在前置文件名后再加上编号;
其它参数:
(1)-l<行数> : 指定每多少行就要切成一个小文件。
(2) -C<字节>:与-b参数类似,但切割时尽量维持每行的完整性。
分割成功之后文件是这样子的:
1.2 多线程读取分割文件
我们使用多线程读取分割的文件,然后开启线程对每个文件进行处理:
public void readFileBySplitFile(String pathname) {
//pathname这里是路径,非具体的文件名,比如:/data/tmp/my
File file = new File(pathname);
File[] files = file.listFiles();
List threads = new ArrayList<>();
for(File f:files) {
MyThread thread = new MyThread(f.getPath());
threads.add(thread);
thread.start();
}
for(MyThread t:threads) {
try {
t.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
private class MyThread extends Thread{
private String pathname;
public MyThread(String pathname) {
this.pathname = pathname;
}
@Override
public void run() {
readFileFileChannel(pathname);
}
}
说明:
(1)获取到指定目录下的所有分割的文件信息;
(2)遍历文件路径,将路径使用线程进行处理,这里线程的run使用readFileChannel进行读取每个文件的信息。
(3)join方法:就是让所有线程等待,然后回到主线程,不懂的可以参之前的一篇文章:《悟纤和师傅去女儿国「线程并行变为串行,Thread你好牛」》
测试:6.5G 耗时:4秒
这个多线程的方式,那么理论上是文件越大,优势会越明显。对于线程开启的个数,这里使用的是文件的个数,在实际中,能这么使用嘛?答案肯定是不行的。相信大家应该知道怎么进行改良下,这里不展开讲解。
二、大文件读取之多线程读取同一个文件
2.1 多线程1.0版本
我们在看一下这种方式就是使用多线程读取同一个文件,这种方式的思路,就是讲文件进行划分,从不同的位置进行读取,那么满足这种要求的就是RandomAccessFile,因为此类中有一个方法seek,可以指定开始的位置。
public void readFileByMutiThread(String pathname, int threadCount) {
BufferedRandomAccessFile randomAccessFile = null;
try {
randomAccessFile = new BufferedRandomAccessFile(pathname, "r");
// 获取文件的长度,进行分割
long fileTotalLength = randomAccessFile.length();
// 分割的每个大小.
long gap = fileTotalLength / threadCount;
// 记录每个的开始位置和结束位置.
long[] beginIndexs = new long[threadCount];
long[] endIndexs = new long[threadCount];
// 记录下一次的位置.
long nextStartIndex = 0;
// 找到每一段的开始和结束的位置.
for (int n = 0; n < threadCount; n++) {
beginIndexs[n] = nextStartIndex;
// 如果是最后一个的话,剩下的部分,就全部给最后一个线程进行处理了.
if (n + 1 == threadCount) {
endIndexs[n] = fileTotalLength;
break;
}
/*
* 不是最后一个的话,需要获取endIndexs的位置.
*/
// (1)上一个nextStartIndex的位置+gap就是下一个位置.
nextStartIndex += gap;
// (2)nextStartIndex可能不是刚好这一行的结尾部分,需要处理下.
// 先将文件移动到这个nextStartIndex的位置,然后往后进行寻找位置.
randomAccessFile.seek(nextStartIndex);
// 主要是计算回车换行的位置.
long gapToEof = 0;
boolean eol = false;
while (!eol) {
switch (randomAccessFile.read()) {
case -1:
eol = true;
break;
case '\n':
eol = true;
break;
case '\r':
eol = true;
break;
default:
gapToEof++;
break;
}
}
// while循环,那个位置刚好是对应的那一行的最后一个字符的结束,++就是换行符号的位置.
gapToEof++;
nextStartIndex += gapToEof;
endIndexs[n] = nextStartIndex;
}
// 开启线程
List threads = new ArrayList<>();
for (int i = 0; i < threadCount; i++) {
MyThread2 thread = new MyThread2(pathname, beginIndexs[i], endIndexs[i]);
threads.add(thread);
thread.start();
}
// 等待汇总数据
for (MyThread2 t : threads) {
try {
t.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
说明:此方法的作用就是对我们的文件根据线程的个数进行位置的分割,每个位置负责一部分的数据处理。
我们看下具体线程的处理:
private class MyThread2 extends Thread{
private long begin;
private long end;
private String pathname;
public MyThread2(String pathname,long begin,long end) {
this.pathname = pathname;
this.begin = begin;
this.end = end;
}
@Override
public void run() {
//System.out.println("TestReadFile.MyThread2.run()-"+begin+"--"+end);
RandomAccessFile randomAccessFile = null;
try {
randomAccessFile = new RandomAccessFile(pathname, "r");
//指定其实读取的位置.
randomAccessFile.seek(begin);
StringBuffer buffer = new StringBuffer();
String str;
while ((str = randomAccessFile.readLine()) != null) {
//System.out.println(str+"--"+Thread.currentThread().getName());
//处理字符串,并不会将字符串保存真正保存到内存中
// 这里简单模拟下处理操作.
buffer.append(str.substring(0,1));
//+1 就是要加上回车换行符号
begin += (str.length()+1);
if(begin>=end) {
break;
}
}
System.out.println("buffer.length:"+buffer.length()+"--"+Thread.currentThread().getName());
} catch (IOException e) {
e.printStackTrace();
}finally {
//TODO close处理.
}
}
}
说明:此线程的主要工作就是根据文件的位置点beginPosition和endPosition读取此区域的数据。
运行看下效果,6.5G的,居然要运行很久,不知道什么时候要结束,实在等待不了,就结束运行了。
为啥会这么慢呐?不是感觉这种处理方式很棒的嘛?为什么要伤害我弱小的心灵。
我们分析下:之前的方法readFileByRandomAccessFile,我们在测试的时候,结果也是很慢,所以可以得到并不是因为我们使用的线程的原因导致了很慢了,那么这个是什么原因导致的呐?
我们找到RandomAccessFile 的readLin()方法:
public final String readLine() throws IOException {
StringBuffer input = new StringBuffer();
int c = -1;
boolean eol = false;
while (!eol) {
switch (c = read()) {
case -1:
case '\n':
eol = true;
break;
case '\r':
eol = true;
long cur = getFilePointer();
if ((read()) != '\n') {
seek(cur);
}
break;
default:
input.append((char)c);
break;
}
}
if ((c == -1) && (input.length() == 0)) {
return null;
}
return input.toString();
}
此方法的原理就是:使用while循环,不停的读取字符,如果遇到\n或者\r的话,那么readLine就结束,并且返回此行的数据,那么核心的方法就是read():
public int read() throws IOException {
return read0();
}
private native int read0() throws IOException;
直接调用的是本地方法了。那么这个方法是做了什么呢?我们可以通过注释分析下:
* Reads a byte of data from this file. The byte is returned as an
* integer in the range 0 to 255 ({@code 0x00-0x0ff}). This
* method blocks if no input is yet available.
通过这里我们可以知道:read()方法会从该文件读取一个字节的数据。 字节返回为介于0到255之间的整数({@code 0x00-0x0ff})。 这个如果尚无输入可用,该方法将阻塞。
到这里,不知道你是否知道这个为啥会这么慢了。一个字节一个字节每次读取,那么肯定是比较慢的嘛。
2.2 多线程2.0版本
那么怎么办呢?有一个类BufferedRandomAccessFile,当然这个类并不属于jdk中的类,需要自己去找下源代码:
package com.kfit.bloomfilter;
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Arrays;
/**
* A BufferedRandomAccessFile
is like a
* RandomAccessFile
, but it uses a private buffer so that most
* operations do not require a disk access.
*
*
* Note: The operations on this class are unmonitored. Also, the correct
* functioning of the RandomAccessFile
methods that are not
* overridden here relies on the implementation of those methods in the
* superclass.
*/
public final class BufferedRandomAccessFile extends RandomAccessFile
{
static final int LogBuffSz_ = 16; // 64K buffer
public static final int BuffSz_ = (1 << LogBuffSz_);
static final long BuffMask_ = ~(((long) BuffSz_) - 1L);
private String path_;
/*
* This implementation is based on the buffer implementation in Modula-3's
* "Rd", "Wr", "RdClass", and "WrClass" interfaces.
*/
private boolean dirty_; // true iff unflushed bytes exist
private boolean syncNeeded_; // dirty_ can be cleared by e.g. seek, so track sync separately
private long curr_; // current position in file
private long lo_, hi_; // bounds on characters in "buff"
private byte[] buff_; // local buffer
private long maxHi_; // this.lo + this.buff.length
private boolean hitEOF_; // buffer contains last file block?
private long diskPos_; // disk position
/*
* To describe the above fields, we introduce the following abstractions for
* the file "f":
*
* len(f) the length of the file curr(f) the current position in the file
* c(f) the abstract contents of the file disk(f) the contents of f's
* backing disk file closed(f) true iff the file is closed
*
* "curr(f)" is an index in the closed interval [0, len(f)]. "c(f)" is a
* character sequence of length "len(f)". "c(f)" and "disk(f)" may differ if
* "c(f)" contains unflushed writes not reflected in "disk(f)". The flush
* operation has the effect of making "disk(f)" identical to "c(f)".
*
* A file is said to be *valid* if the following conditions hold:
*
* V1. The "closed" and "curr" fields are correct:
*
* f.closed == closed(f) f.curr == curr(f)
*
* V2. The current position is either contained in the buffer, or just past
* the buffer:
*
* f.lo <= f.curr <= f.hi
*
* V3. Any (possibly) unflushed characters are stored in "f.buff":
*
* (forall i in [f.lo, f.curr): c(f)[i] == f.buff[i - f.lo])
*
* V4. For all characters not covered by V3, c(f) and disk(f) agree:
*
* (forall i in [f.lo, len(f)): i not in [f.lo, f.curr) => c(f)[i] ==
* disk(f)[i])
*
* V5. "f.dirty" is true iff the buffer contains bytes that should be
* flushed to the file; by V3 and V4, only part of the buffer can be dirty.
*
* f.dirty == (exists i in [f.lo, f.curr): c(f)[i] != f.buff[i - f.lo])
*
* V6. this.maxHi == this.lo + this.buff.length
*
* Note that "f.buff" can be "null" in a valid file, since the range of
* characters in V3 is empty when "f.lo == f.curr".
*
* A file is said to be *ready* if the buffer contains the current position,
* i.e., when:
*
* R1. !f.closed && f.buff != null && f.lo <= f.curr && f.curr < f.hi
*
* When a file is ready, reading or writing a single byte can be performed
* by reading or writing the in-memory buffer without performing a disk
* operation.
*/
/**
* Open a new BufferedRandomAccessFile
on file
* in mode mode
, which should be "r" for reading only, or
* "rw" for reading and writing.
*/
public BufferedRandomAccessFile(File file, String mode) throws IOException
{
this(file, mode, 0);
}
public BufferedRandomAccessFile(File file, String mode, int size) throws IOException
{
super(file, mode);
path_ = file.getAbsolutePath();
this.init(size);
}
/**
* Open a new BufferedRandomAccessFile
on the file named
* name
in mode mode
, which should be "r" for
* reading only, or "rw" for reading and writing.
*/
public BufferedRandomAccessFile(String name, String mode) throws IOException
{
this(name, mode, 0);
}
public BufferedRandomAccessFile(String name, String mode, int size) throws FileNotFoundException
{
super(name, mode);
path_ = name;
this.init(size);
}
private void init(int size)
{
this.dirty_ = false;
this.lo_ = this.curr_ = this.hi_ = 0;
this.buff_ = (size > BuffSz_) ? new byte[size] : new byte[BuffSz_];
this.maxHi_ = (long) BuffSz_;
this.hitEOF_ = false;
this.diskPos_ = 0L;
}
public String getPath()
{
return path_;
}
public void sync() throws IOException
{
if (syncNeeded_)
{
flush();
getChannel().force(true);
syncNeeded_ = false;
}
}
// public boolean isEOF() throws IOException
// {
// assert getFilePointer() <= length();
// return getFilePointer() == length();
// }
public void close() throws IOException
{
this.flush();
this.buff_ = null;
super.close();
}
/**
* Flush any bytes in the file's buffer that have not yet been written to
* disk. If the file was created read-only, this method is a no-op.
*/
public void flush() throws IOException
{
this.flushBuffer();
}
/* Flush any dirty bytes in the buffer to disk. */
private void flushBuffer() throws IOException
{
if (this.dirty_)
{
if (this.diskPos_ != this.lo_)
super.seek(this.lo_);
int len = (int) (this.curr_ - this.lo_);
super.write(this.buff_, 0, len);
this.diskPos_ = this.curr_;
this.dirty_ = false;
}
}
/*
* Read at most "this.buff.length" bytes into "this.buff", returning the
* number of bytes read. If the return result is less than
* "this.buff.length", then EOF was read.
*/
private int fillBuffer() throws IOException
{
int cnt = 0;
int rem = this.buff_.length;
while (rem > 0)
{
int n = super.read(this.buff_, cnt, rem);
if (n < 0)
break;
cnt += n;
rem -= n;
}
if ( (cnt < 0) && (this.hitEOF_ = (cnt < this.buff_.length)) )
{
// make sure buffer that wasn't read is initialized with -1
Arrays.fill(this.buff_, cnt, this.buff_.length, (byte) 0xff);
}
this.diskPos_ += cnt;
return cnt;
}
/*
* This method positions this.curr
at position pos
.
* If pos
does not fall in the current buffer, it flushes the
* current buffer and loads the correct one.
*
* On exit from this routine this.curr == this.hi
iff pos
* is at or past the end-of-file, which can only happen if the file was
* opened in read-only mode.
*/
public void seek(long pos) throws IOException
{
if (pos >= this.hi_ || pos < this.lo_)
{
// seeking outside of current buffer -- flush and read
this.flushBuffer();
this.lo_ = pos & BuffMask_; // start at BuffSz boundary
this.maxHi_ = this.lo_ + (long) this.buff_.length;
if (this.diskPos_ != this.lo_)
{
super.seek(this.lo_);
this.diskPos_ = this.lo_;
}
int n = this.fillBuffer();
this.hi_ = this.lo_ + (long) n;
}
else
{
// seeking inside current buffer -- no read required
if (pos < this.curr_)
{
// if seeking backwards, we must flush to maintain V4
this.flushBuffer();
}
}
this.curr_ = pos;
}
public long getFilePointer()
{
return this.curr_;
}
public long length() throws IOException
{
// max accounts for the case where we have written past the old file length, but not yet flushed our buffer
return Math.max(this.curr_, super.length());
}
public int read() throws IOException
{
if (this.curr_ >= this.hi_)
{
// test for EOF
// if (this.hi < this.maxHi) return -1;
if (this.hitEOF_)
return -1;
// slow path -- read another buffer
this.seek(this.curr_);
if (this.curr_ == this.hi_)
return -1;
}
byte res = this.buff_[(int) (this.curr_ - this.lo_)];
this.curr_++;
return ((int) res) & 0xFF; // convert byte -> int
}
public int read(byte[] b) throws IOException
{
return this.read(b, 0, b.length);
}
public int read(byte[] b, int off, int len) throws IOException
{
if (this.curr_ >= this.hi_)
{
// test for EOF
// if (this.hi < this.maxHi) return -1;
if (this.hitEOF_)
return -1;
// slow path -- read another buffer
this.seek(this.curr_);
if (this.curr_ == this.hi_)
return -1;
}
len = Math.min(len, (int) (this.hi_ - this.curr_));
int buffOff = (int) (this.curr_ - this.lo_);
System.arraycopy(this.buff_, buffOff, b, off, len);
this.curr_ += len;
return len;
}
public void write(int b) throws IOException
{
if (this.curr_ >= this.hi_)
{
if (this.hitEOF_ && this.hi_ < this.maxHi_)
{
// at EOF -- bump "hi"
this.hi_++;
}
else
{
// slow path -- write current buffer; read next one
this.seek(this.curr_);
if (this.curr_ == this.hi_)
{
// appending to EOF -- bump "hi"
this.hi_++;
}
}
}
this.buff_[(int) (this.curr_ - this.lo_)] = (byte) b;
this.curr_++;
this.dirty_ = true;
syncNeeded_ = true;
}
public void write(byte[] b) throws IOException
{
this.write(b, 0, b.length);
}
public void write(byte[] b, int off, int len) throws IOException
{
while (len > 0)
{
int n = this.writeAtMost(b, off, len);
off += n;
len -= n;
this.dirty_ = true;
syncNeeded_ = true;
}
}
/*
* Write at most "len" bytes to "b" starting at position "off", and return
* the number of bytes written.
*/
private int writeAtMost(byte[] b, int off, int len) throws IOException
{
if (this.curr_ >= this.hi_)
{
if (this.hitEOF_ && this.hi_ < this.maxHi_)
{
// at EOF -- bump "hi"
this.hi_ = this.maxHi_;
}
else
{
// slow path -- write current buffer; read next one
this.seek(this.curr_);
if (this.curr_ == this.hi_)
{
// appending to EOF -- bump "hi"
this.hi_ = this.maxHi_;
}
}
}
len = Math.min(len, (int) (this.hi_ - this.curr_));
int buffOff = (int) (this.curr_ - this.lo_);
System.arraycopy(b, off, this.buff_, buffOff, len);
this.curr_ += len;
return len;
}
}
然后将我们在上面使用到的类RandomAccessFile 替换成BufferedRandomAccessFile 即可。
来测试下吧:
如果是前面的方法:
TestReadFile.readFileByBufferedRandomAccessFile(pathname2);
6.5G 耗时:32秒
相比之前一直不能读取的情况下,已经是好很多了,但是相对于nio的话,还是慢了。
测试下多线程版本的吧:
6.5G 耗时:2个线程20秒,3个线程16秒,4个线程14秒,5个线程11秒,6个线程8秒,7个线程8秒,8个线程9秒
我这个Mac电脑是6核处理器,所以在6核的时候,达到了性能的最高点,在开启的更多的时候,线程的上下文切换会浪费这个时间,所以时间就越越来越高。但和上面的版本好像还是不能媲美。
2.3 多线程3.0版本
RandomAccessFile的绝大多数功能,在JDK 1.4以后被nio的”内存映射文件(memory-mapped files)”给取代了MappedByteBuffer,大家可以自行去尝试下,本文就不展开讲解了。
三、悟纤小结
师傅:本文有点难,也有点辣眼睛和骚脑,今天就为师给你总结下。
徒儿:师傅,我太难了,我都要听睡着了。
师傅:文件操作本身就会比较复杂,在一个项目中,也不是所有人都会去写IO流的代码。
来个小结,主要讲了两个知识点。
(1)第一:使用文件分隔的方式读取大文件,配套NIO的技术,速度会有提升。核心的思路就是:使用Mac/Linx下的split命令,将大文件分割成几个小的文件,然后使用多线程分别读取每个小文件。13.56G :分割为6个文件,耗时8秒;26G,耗时16秒。按照这样的情况,那么读取100G的时间,也就是1分钟左右的事情了,当然实际耗时,还是和你具体的获取数据的处理方法有很大的关系,比如你使用系统的System.out的话,那么这个时间就很长了。
(2)第二:使用多线程读取大文件。核心的思路就是:根据文件的长度将文件分割成n段,然后开启多线程利用类RandomAccessFile的位置定位seek方法,直接从此位置开启读取。13.56G :6个线程耗时23秒。
另外实际上NIO的FileChannel单线程下的读取速度也是挺快的:13.56G :耗时15秒,之前就提到过了Java天然支持大文件的处理,这就是Java ,不仅Write once ,而且Write happy。
最后要注意下,ByteBuffer读取到的是很多行的数据,不是一行一行的数据。
我就是我,是颜色不一样的烟火。
我就是我,是与众不同的小苹果。
学院中有Spring Boot相关的课程:
à悟空学院:https://t.cn/Rg3fKJD
SpringBoot视频:http://t.cn/A6ZagYTi
Spring Cloud视频:http://t.cn/A6ZagxSR
SpringBoot Shiro视频:http://t.cn/A6Zag7IV
SpringBoot交流平台:https://t.cn/R3QDhU0
SpringData和JPA视频:http://t.cn/A6Zad1OH
SpringSecurity5.0视频:http://t.cn/A6ZadMBe
Sharding-JDBC分库分表实战:http://t.cn/A6ZarrqS
分布式事务解决方案「手写代码」:http://t.cn/A6ZaBnIr
JVM内存模型和性能调优:http://t.cn/A6wWMVqG