Linux的split命令可以用来分割文件
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file//这个命令在保证数据行完整性的前提下,按大小分割文件
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit
SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.
可以用一个简单的python脚本将指定的文件分割成几个大小和条数近似相同的文件:
代码如下:
import os
def _split_file(filepath,theFileNumber):
filesize = __file_size(filepath)
slavelength = theFileNumber
splitsize = filesize/slavelength + 1000
//这里大小加上1000个字节,保证最后一个分割的文件不会丢数据
command = " split -C %d %s %s%s" % (splitsize, filepath, prefix,filepath[filepath.rfind("/")+1:])
//其实使用的就是liunx的split
print command
os.system(command)
def __file_size(filepath):
statinfo=os.stat(filepath)
return statinfo.st_size
if __name__ == "__main__":
dataFilePath="/data/big_file.log"
_split_file(dataFilePath,4)