pd.DataFrame()函数

1.DataFrame介绍

一个Datarame表示一个表格,类似电子表格的数据结构,包含一个经过排序的列表集,它的每一列都可以有不同的类型值(数字,字符串,布尔等等)。Datarame有行和列的索引;它可以被看作是一个Series的字典(Series们共享一个索引)。与其它你以前使用过的(如 R 的 data.frame )类似Datarame的结构相比,在DataFrame里的面向行和面向列的操作大致是对称的。在底层,数据是作为一个或多个二维数组存储的,而不是列表,字典,或其它一维的数组集合。

DataFrame([data, index, columns, dtype, copy])	
# Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

2 DataFrame创建

import pandas as pd
import numpy as np
  1. 使用字典创建
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, np.nan],  # np.nan表示NA
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
DataFrame(data,
#           index=['a','b','c','d','e']
#           index = range(5)
         )  # 默认生成整数索引, 字典的键作列,值作行

输出结果为:


    state	year	pop
0	Ohio	2000.0	1.5
1	Ohio	2001.0	1.7
2	Ohio	2002.0	3.6
3	Nevada	2001.0	2.4
4	Nevada	NaN	2.9
  1. pd.DataFrame.from_dict 方法生成DataFrame
# 两层嵌套
d = {'a': {'tp': 26, 'fp': 112},
     'b': {'tp': 26, 'fp': 91},
     'c': {'tp': 23, 'fp': 74}}
df_index = pd.DataFrame.from_dict(d, orient='index')
df_index

输出结果为:

	tp	fp
a	26	112
b	26	91
c	23	74
df_columns = pd.DataFrame.from_dict(d,orient='columns')
df_columns

输出结果为:

	a	b	c
fp	112	91	74
tp	26	26	23
  1. 通过传递一个numpy array,时间索引以及列标签来创建一个DataFrame
data = DataFrame(np.arange(10,26).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                 columns=['one', 'two', 'three', 'four'])
data

输出结果为:

	one	two	three	four
Ohio	10	11	12	13
Colorado	14	15	16	17
Utah	18	19	20	21
New York	22	23	24	25

生成一个df

np.random.seed(10)
dates = pd.date_range('20190101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

				A			B			C			D
2019-01-01	1.331587	0.715279	-1.545400	-0.008384
2019-01-02	0.621336	-0.720086	0.265512	0.108549
2019-01-03	0.004291	-0.174600	0.433026	1.203037
2019-01-04	-0.965066	1.028274	0.228630	0.445138
2019-01-05	-1.136602	0.135137	1.484537	-1.079805
2019-01-06	-1.977728	-1.743372	0.266070	2.384967

3 DataFrame基本属性

  • DataFrame.index: The index (row labels) of the DataFrame.
df.index

输出结果为:

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')
  • 设置索引名
    df.index.name = ‘time’
df.index.name = 'time'
  • DataFrame.columns :The column labels of the DataFrame.
df.columns

输出结果为:

Index(['A', 'B', 'C', 'D'], dtype='object')
  • 设置列列名
df.columns.name = 'alphabet'
  • DataFrame.values Return a Numpy representation of the DataFrame.
  • 查看底层的Numpy数据
df.values
array([[-0.96506567,  1.02827408,  0.22863013,  0.44513761],
       [-1.13660221,  0.13513688,  1.484537  , -1.07980489],
       [-1.97772828, -1.7433723 ,  0.26607016,  2.38496733]])

4 DataFrame索引

  • DataFrame.head(self[, n]) Return the first n rows.
df.head(3)  # 显示前三行

				A			B			C			D
2019-01-01	1.331587	0.715279	-1.545400	-0.008384
2019-01-02	0.621336	-0.720086	0.265512	0.108549
2019-01-03	0.004291	-0.174600	0.433026	1.203037
  • DataFrame.tail(self[, n]) Return the last n rows.
df.tail(3)   # 显示后三行
				A			B			C			D
2019-01-04	-0.965066	1.028274	0.228630	0.445138
2019-01-05	-1.136602	0.135137	1.484537	-1.079805
2019-01-06	-1.977728	-1.743372	0.266070	2.384967
  • DataFrame.set_index(self, keys[, drop, …])
df = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
df

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3
# set_index方法将DataFrame的一个或者多个列转化为行索引
df2 = df.set_index(['c', 'd'])
df2

		a	b
 c	d		
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1
  • 默认drop = True,当drop=False 不删除原始数据
df.set_index(['c', 'd'], drop=False)
		a	b	c	d
c	d				
one	0	0	7	one	0
	1	1	6	one	1
	2	2	5	one	2	
two	0	3	4	two	0
	1	4	3	two	1
	2	5	2	two	2
	3	6	1	two	3

-reset_index的功能和set_index的刚好相反,层次化索引的级别会被转移到列里面

df2.reset_index()
	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1

5 DataFrame计算、描述性统计

  • DataFrame.round(self[, decimals]) Round a DataFrame to a variable number of decimal places.
  • 显示数字保留两位小数
df.round(2)
			A		B		C		D
2019-01-01	1.33	0.72	-1.55	-0.01
2019-01-02	0.62	-0.72	0.27	0.11
2019-01-03	0.00	-0.17	0.43	1.20
2019-01-04	-0.97	1.03	0.23	0.45
2019-01-05	-1.14	0.14	1.48	-1.08
2019-01-06	-1.98	-1.74	0.27	2.38
  • 不同的列制定不同的小数位数
df.round({'A': 1, 'C': 2})

			  A		   B		  C		  D
2019-01-01	 1.3	 0.715279	-1.55	-0.008384
2019-01-02	 0.6	-0.720086	0.27	0.108549
2019-01-03	 0.0	-0.174600	0.43	1.203037
2019-01-04	-1.0	 1.028274	0.23	0.445138
2019-01-05	-1.1	 0.135137	1.48	-1.079805
2019-01-06	-2.0	-1.743372	0.27	2.384967
  • DataFrame.describe(self[, percentiles, …]) Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
# 数值型数据的快速统计汇总
df.describe()
alphabet	A			B			C			D
count	3.000000	3.000000	3.000000	3.000000
mean	-1.359799	-0.193320	0.659746	0.583433
std		0.541972	1.414715	0.714535	1.736521
min		-1.977728	-1.743372	0.228630	-1.079805
25%		-1.557165	-0.804118	0.247350	-0.317334
50%		-1.136602	0.135137	0.266070	0.445138
75%		-1.050834	0.581705	0.875304	1.415052
max		-0.965066	1.028274	1.484537	2.384967
  • DataFrame.apply(self, func[, axis, …]) Apply a function along an axis of the DataFrame.¶
df
                A	        B	        C	    D	  F
2019-01-01	0.000000	0.000000	-1.545400	5	NaN
2019-01-02	0.621336	-0.720086	0.265512	5	1.0
2019-01-03	0.004291	-0.174600	0.433026	5	2.0
2019-01-04	-0.965066	1.028274	0.228630	5	3.0
2019-01-05	-1.136602	0.135137	1.484537	5	4.0
2019-01-06	-1.977728	-1.743372	0.266070	5	5.0
df.apply(np.cumsum, axis=0, result_type=None )
				A			B			C		D	 F
2019-01-01	0.000000	0.000000	-1.545400	5	NaN
2019-01-02	0.621336	-0.720086	-1.279889	10	1.0
2019-01-03	0.625627	-0.894686	-0.846863	15	3.0
2019-01-04	-0.339438	0.133588	-0.618232	20	6.0
2019-01-05	-1.476040	0.268725	0.866305	25	10.0
2019-01-06	-3.453769	-1.474647	1.132375	30	15.0
df.apply(lambda x: x.max() - x.min())  # 每一列的极差

6 重新索引、选择、标签操作

  • DataFrame.rename(self[, mapper, index, …]) Alter axes labels.
  • 修改列名
df.rename(columns = {'A':'key2'},inplace=False)

7 排序

  • DataFrame.sort_index(self[, axis, level, …]) Sort object by labels (along an axis).
# 默认axis=0,按行索引对行进行排序;ascending=True,升序排序
df.sort_index(axis=0, ascending=False)
# df.sort_index(axis=0, ascending=True)

				A			B			C			D
2019-01-06	-1.977728	-1.743372	0.266070	2.384967
2019-01-05	-1.136602	0.135137	1.484537	-1.079805
2019-01-04	-0.965066	1.028274	0.228630	0.445138
2019-01-03	0.004291	-0.174600	0.433026	1.203037
2019-01-02	0.621336	-0.720086	0.265512	0.108549
2019-01-01	1.331587	0.715279	-1.545400	-0.008384

你可能感兴趣的:(#,pandas)