python pandas 中文件的读写——read_csv()读取文件

read_csv()读取文件

1.python读取文件的几种方式

  • read_csv 从文件,url,文件型对象中加载带分隔符的数据。默认分隔符为逗号
  • read_table 从文件,url,文件型对象中加载带分隔符的数据。默认分隔符为制表符(“\t”)
  • read_fwf 读取定宽列格式数据(也就是没有分隔符)
  • read_cliboard 读取剪切板中的数据,可以看做read_table的剪切板。在将网页转换为表格时很有用

2.读取文件的简单实现

程序代码:

df=pd.read_csv('D:/project/python_instruct/test_data1.csv')
print('用read_csv读取的csv文件:', df)
df=pd.read_table('D:/project/python_instruct/test_data1.csv', sep=',')
print('用read_table读取csv文件:', df)

df=pd.read_csv('D:/project/python_instruct/test_data2.csv', header=None)
print('用read_csv读取无标题行的csv文件:', df)
df=pd.read_csv('D:/project/python_instruct/test_data2.csv', names=['a', 'b', 'c', 'd', 'message'])
print('用read_csv读取自定义标题行的csv文件:', df)

names=['a', 'b', 'c', 'd', 'message']
df=pd.read_csv('D:/project/python_instruct/test_data2.csv', names=names, index_col='message')
print('read_csv读取时指定索引:', df)

parsed=pd.read_csv('D:/project/python_instruct/test_data3.csv', index_col=['key1', 'key2'])
print('read_csv将多个列做成一个层次化索引:')
print(parsed)

print(list(open('D:/project/python_instruct/test_data1.txt')))
result=pd.read_table('D:/project/python_instruct/test_data1.txt', sep='\s+')
print('read_table利用正则表达式处理文件读取:')
print(result)

输出结果:

用read_csv读取的csv文件:    
a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
用read_table读取csv文件:    
a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
用read_csv读取无标题行的csv文件:    

0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo
用read_csv读取自定义标题行的csv文件:    
a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo
read_csv读取时指定索引:         
 a   b   c   d
message               
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12
read_csv将多个列做成一个层次化索引:
           value1  value2
key1 key2                
one  a          1       2
     b          3       4
     c          5       6
     d          7       8
two  a          9      10
     b         11      12
     c         13      14
     d         15      16
['      A     B    C \n', 'aaa -0.26 -0.1 -0.4\n', 'bbb -0.92 -0.4 -0.7\n', 'ccc -0.34 -0.5 -0.8\n', 'ddd -0.78 -0.3 -0.2']
read_table利用正则表达式处理文件读取:
        A    B    C
aaa -0.26 -0.1 -0.4
bbb -0.92 -0.4 -0.7
ccc -0.34 -0.5 -0.8
ddd -0.78 -0.3 -0.2

3.分块读取大型数据集

先看代码:

reslt=pd.read_csv('D:\project\python_instruct\weibo_network.txt')
print('原始文件:', result)

输出:

Traceback (most recent call last):

  File "", line 1, in 
    runfile('D:/project/python_instruct/Test.py', wdir='D:/project/python_instruct')

  File "D:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "D:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "D:/project/python_instruct/Test.py", line 75, in 
    reslt=pd.read_csv('D:\project\python_instruct\weibo_network.txt')

  File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 325, in _read
    return parser.read()

  File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 815, in read
    ret = self._engine.read(nrows)

  File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1314, in read
    data = self._reader.read(nrows)

  File "pandas\parser.pyx", line 805, in pandas.parser.TextReader.read (pandas\parser.c:8748)

  File "pandas\parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:9003)

  File "pandas\parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas\parser.c:9731)

  File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:9602)

  File "pandas\parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas\parser.c:23325)

CParserError: Error tokenizing data. C error: out of memory

发现数据集大得已经超出内存。我们可以读取几行看看,如前10行:

result=pd.read_csv('D:\project\python_instruct\weibo_network.txt', nrows=10)
print('只读取几行:')
print(result)

输出结果:

0  0\t296\t3\t1\t10\t1\t12\t1\t13\t1\t14\t1\t16\t...
1  1\t271\t8\t1\t17\t1\t22\t1\t31\t0\t34\t1\t6742...
2  2\t158\t0\t0\t5\t1\t10\t1\t11\t1\t13\t1\t16\t0...
3  3\t413\t0\t1\t5\t1\t194\t1\t354\t1\t3462\t1\t8...
4  4\t142\t1\t0\t5\t1\t7\t1\t11\t1\t14\t1\t18\t1\...
5  5\t272\t2\t1\t3\t1\t4\t1\t12\t1\t13\t1\t14\t1\...
6  6\t59\t9\t1\t13\t1\t46991\t0\t66930\t0\t85672\...
7  7\t131\t4\t1\t11\t1\t20\t1\t24\t1\t26\t0\t30\t...
8  8\t326\t0\t0\t1\t1\t12\t1\t13\t1\t17\t1\t19\t1...
9  9\t12\t0\t0\t6\t1\t10\t1\t13\t1\t18\t0\t466527...

 

你可能感兴趣的:(Pyhon)