以下代码是“达观杯”csv数据文件读取,来源: 加载大数据:带有可爱的读取进度条
import time
import pandas as pd
from tqdm import tqdm
# @execution_time
def reader_pandas(file, chunkSize=100000, patitions=10 ** 4):
reader = pd.read_csv(file, iterator=True)
chunks = []
with tqdm(range(patitions), 'Reading ...') as t:
for _ in t:
try:
chunk = reader.get_chunk(chunkSize)
chunks.append(chunk)
except StopIteration:
break
return pd.concat(chunks, ignore_index=True)
print(reader_pandas("./data/train_set.csv"))
输出:
D:\software\Anaconda3\python.exe D:/Competitions/DaGuanBei/test.py
Reading ...: 0%| | 2/10000 [00:41<79:10:31, 28.51s/it]
id ... class
0 0 ... 14
1 1 ... 3
2 2 ... 12
3 3 ... 13
4 4 ... 12
5 5 ... 13
6 6 ... 1
7 7 ... 10
8 8 ... 10
9 9 ... 19
10 10 ... 18
11 11 ... 7
12 12 ... 9
13 13 ... 4
14 14 ... 17
15 15 ... 9
16 16 ... 13
17 17 ... 10
18 18 ... 10
19 19 ... 14
20 20 ... 10
21 21 ... 9
22 22 ... 1
23 23 ... 2
24 24 ... 13
25 25 ... 1
26 26 ... 7
27 27 ... 17
28 28 ... 10
29 29 ... 8
... ... ... ...
102247 102247 ... 9
102248 102248 ... 18
102249 102249 ... 13
102250 102250 ... 9
102251 102251 ... 1
102252 102252 ... 14
102253 102253 ... 12
102254 102254 ... 11
102255 102255 ... 19
102256 102256 ... 2
102257 102257 ... 4
102258 102258 ... 3
102259 102259 ... 6
102260 102260 ... 9
102261 102261 ... 1
102262 102262 ... 18
102263 102263 ... 6
102264 102264 ... 8
102265 102265 ... 16
102266 102266 ... 18
102267 102267 ... 15
102268 102268 ... 3
102269 102269 ... 3
102270 102270 ... 3
102271 102271 ... 8
102272 102272 ... 14
102273 102273 ... 8
102274 102274 ... 12
102275 102275 ... 4
102276 102276 ... 11
[102277 rows x 4 columns]
Process finished with exit code 0
上面的代码运用的是pandas的read_csv(),默认参数sep=','分隔符为',',正好和csv以逗号为分隔符吻合。
iterator : boolean, default False
返回一个TextFileReader 对象,以便逐块处理文件。
iterator=True表示逐块读取文件。
reader.get_chunk(chunkSize)表示每次读取块的大小为chunkSize。
tqdm模块是用来打印读取文件的进度条,详见参考资料。
参考资料:
pandas.read_csv参数详解
pandas.read_csv——分块读取大文件
python的Tqdm模块