CSV称为逗号分隔值,有时也称字符分隔值,其文件以纯文本形式储存表格数据
pd.read_csv(filepath_or_buffer,sep=',',*args,**kwargs)
"""
parameters:
filepath_or_buffer:文件地址或文件对象
sep:分隔符,默认为','
header:以整数表示该行作为列标签
skiprows:忽略某行,skiprows=0即忽略第一行数据
na_values:用于替换NA的值
index_col:以整数或字符串指定某一列或多列作为索引
nrows:指定需要读取的行数
names:当header=None时,以列表作为列标签
chunksize:指定文件块的大小,分块读入,得到TextFileReader对象可迭代
"""
读取Excel文件
pd.read_excel(sheet_name=0,header=0,names=None,**kwargs)
parameters:
sheet_name:指定读取表单,可以是索引0代表第一个表单,可以是表单名
读取excel返回的是DataFrame
data=pd.read_excel('test.xls')
print(type(data)) #
print(data.columns) # Index(['name', 'age', 'grade'], dtype='object')
print(data.iloc[0])
"""
name zx
age 20
grade 3
"""
print(data.iloc[0,0]) #zx
缺失数据为一个数据
None:空
np.nan:一个浮点数,用来标记缺失数据
a=np.array([1,np.nan,5,7,9])
print(a.dtype)
print(np.sum(a))
print(np.nansum(a))
结果:
dtype('float64')
nan
22.0
NaN:Not a Number,表示未定义或不可表示的数据;计算时,pd也会屏蔽NaN
s=pd.Series([1,np.nan,3,None,7,9])
print(s)
print(s.sum())
print(s.dropna())
#删除所有NaN数据,并返回不含NaN的新对象
结果:
0 1.0
1 NaN
2 3.0
3 NaN
4 7.0
5 9.0
dtype: float64
20.0
0 1.0
2 3.0
4 7.0
5 9.0
dtype: float64
pandas有4个针对缺失数据处理的函数;
s.isnull()
s.notnull()
s.dropna(axis=0):沿着某axis删除带有缺失数据的行或列
s.fillna(value,method=None,axis=0):将value值覆盖数据中的缺失值
pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False,keys=None,levels=None,names=None,verify_integrity=False,copy=False)
"""
objs : a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
The axis to concatenate along.
join : {'inner', 'outer'}, default 'outer' 对索引取交或者并集,默认为并集outer
How to handle indexes on other axis (or axes).
ignore_index : bool, default False 重置索引
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
keys : sequence, default None 设置多级索引
If multiple levels passed, should contain tuples. Construct
hierarchical index using the passed keys as the outermost level.
levels : list of sequences, default None
Specific levels (unique values) to use for constructing a
MultiIndex. Otherwise they will be inferred from the keys.
names : list, default None 为标签索引,列索引取名
Names for the levels in the resulting hierarchical index.
verify_integrity : bool, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
sort : bool, default False
Sort non-concatenation axis if it is not already aligned when `join` is 'outer'.
This has no effect when ``join='inner'``, which already preserves the order of the non-concatenation axis.
"""
总结:
objs:序列,参与连接的对象,对象为字典字典的键为二级索引
axis:指定沿着哪个轴连接,默认为0
keys:用于设置多级索引
names:多级索引的名字
join:默认为'outer',其他轴索引按并集连接;'inner',其他轴索引按交集连接
ignore_index:默认为'None',忽略原有索引
举例:
df1=pd.DataFrame([[110,120,130],[210,220,230],[310,320,330]])
df2=pd.DataFrame([[11,12,13],[21,22,23],[31,32,33]])
print(pd.concat([df1,df2]))
df1.index=['a','c','d']
df2.index=['b','c','d']
print(pd.concat([df1,df2],axis=0,keys=['df1','df2'])) #多级索引
print(pd.concat([df1,df2],axis=1)) #'outer'并集索引
print(pd.concat([df1,df2],axis=1,join='inner')) #'inner' 交集索引
结果:
0 1 2
0 110 120 130
1 210 220 230
2 310 320 330
0 11 12 13
1 21 22 23
2 31 32 33
0 1 2
df1 a 110 120 130
c 210 220 230
d 310 320 330
df2 b 11 12 13
c 21 22 23
d 31 32 33
0 1 2 0 1 2
a 110.0 120.0 130.0 NaN NaN NaN
c 210.0 220.0 230.0 21.0 22.0 23.0
d 310.0 320.0 330.0 31.0 32.0 33.0
b NaN NaN NaN 11.0 12.0 13.0
0 1 2 0 1 2
c 210 220 230 21 22 23
d 310 320 330 31 32 33
'''
Series.append : Concatenate Series.
DataFrame.append : Concatenate DataFrames.
仅限于axis=0的连接,且不改变原有对象
'''
print(df1.append(df2))
'''
0 1 2
a 110 120 130
c 210 220 230
d 310 320 330
b 11 12 13
c 21 22 23
d 31 32 33
'''
pd.merge(left,right,how='inner',on=None,left_on=None,right_on=None,left_index=False,right_index=False,sort=False,suffixes=('_x','_y'),indicator=False)
'''
left、right:要合并的两个DataFrame对象
how:默认为'inner',可选参数'outer','left、right':以left/right的指定字段以并集方式合并数据
on:字符串或者列表,用于合并的字段名称,若为指定,默认两个DataFrame对象交集
indicator:注释每行数据的来源
left_on、right_on:从left、right所引用对象的索引为键进行合并
left_index/right_index:默认为False,如果为True,则以left/right所引用对象的索引为键合并数据
'''
merge和concat的区别:
merge操作是针对DataFrame对象取并集 合并
concat操作是针对索引不对数据进行处理 连接
df1.combine_first(df2)
df:DataFrame对象
用df2中数据补充df1
inf1=pd.DataFrame({"name":['zx','hack',np.nan],
"age":[np.nan,19,20]})
inf2=pd.DataFrame({"name":['zx','hack','mary','aaron'],
"age":[18,np.nan,21,20]})
print(inf1.combine_first(inf2))
行列转换
stack(), unstack()方法
dropnan参数默认为True,即不保留数据中的NaN
透视表
df.pivot(index=None,columns=None,values=None)
"""
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a "pivot" table) based on column values. Uses unique values from specified `index` / `columns` to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. See the :ref:`User Guide ` for more on reshaping.
让所给的参数重组数据,不支持数据聚合
"""
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
print(df.pivot(index='foo',columns='bar',values='baz'))
print(df.pivot(index='foo',columns='bar')['baz'])#跟上一语句作用相同
print(df.pivot(index='foo',columns='bar',values=['baz','zoo']))
结果:
bar A B C
foo
one 1 2 3
two 4 5 6
bar A B C
foo
one 1 2 3
two 4 5 6
baz zoo
bar A B C A B C
foo
one 1 2 3 x y z
two 4 5 6 q w t
当出现同一位置两个不同值时会报错
df.pivot_table(index=None,columns=None,values=None,aggfunc='mean',dropnan=True)
"""
aggfunc : function, list of functions, dict, default numpy.mean
"""