Series是一种类似于一维数组的对象,它由一组数据(各种Numpy数据类型)以及一组与之相关的数据标签(即索引)组成。
Series的字符串表现形式为:索引在左边,值在右边。
import pandas as pd
import numpy as np
s = pd.Series([1,3,4,np.NaN,8,4])
print(s)
0 1.0
1 3.0
2 4.0
3 NaN
4 8.0
5 4.0
dtype: float64
DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(共用同一个索引)。
通过指定内容、索引、列来创建DataFrame:
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
print(dates)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
print(data.shape)
print(data.values) #值为一个二维array
DatetimeIndex(['2019-10-28', '2019-10-29', '2019-10-30', '2019-10-31',
'2019-11-01'],
dtype='datetime64[ns]', freq='D')
A B C D
2019-10-28 0.787455 -1.041718 0.577362 0.889686
2019-10-29 -0.017887 0.630519 -0.550084 0.373894
2019-10-30 1.969060 -1.010220 -0.166625 -1.247600
2019-10-31 -2.103929 -0.359340 -1.139270 1.283892
2019-11-01 -2.059919 0.690421 0.228301 -0.961284
(5, 4)
[[ 0.7874552 -1.04171802 0.5773621 0.88968563]
[-0.01788734 0.63051917 -0.5500839 0.37389389]
[ 1.9690596 -1.01022027 -0.16662497 -1.2476003 ]
[-2.10392864 -0.35933967 -1.13927039 1.28389242]
[-2.05991921 0.69042131 0.22830095 -0.961284 ]]
通过字典的形式创建:
import pandas as pd
import numpy as np
d = {'A':[4,3,2,1],"B":pd.Timestamp('20191028'),'C':range(4),'D':np.arange(4)}
print(d)
df = pd.DataFrame(d)
print(df)
print('[df.dtypes]:\n',df.dtypes) #每一列的数据类型
print('[df.A]:\n',df.A) #A这一列的索引+数据
print('[type(df.A)]:\n',type(df.A)) #A这一列是series类型
{'A': [4, 3, 2, 1], 'B': Timestamp('2019-10-28 00:00:00'), 'C': range(0, 4), 'D': array([0, 1, 2, 3])}
A B C D
0 4 2019-10-28 0 0
1 3 2019-10-28 1 1
2 2 2019-10-28 2 2
3 1 2019-10-28 3 3
[df.dtypes]:
A int64
B datetime64[ns]
C int64
D int32
dtype: object
[df.A]:
0 4
1 3
2 2
3 1
Name: A, dtype: int64
[type(df.A)]:
<class 'pandas.core.series.Series'>
import pandas as pd
import numpy as np
d = {'A':[4,3,2,1],"B":pd.Timestamp('20191028'),'C':range(4),'D':np.arange(4)}
df = pd.DataFrame(d)
print('整体情况:\n')
print(df.describe())
print('前五行数据:\n')
print(df.head())
print('前两行数据:\n')
print(df.head(2))
print('最后五行数据:\n')
print(df.tail())
print('最后两行数据:\n')
print(df.tail(2))
整体情况:
A C D
count 4.000000 4.000000 4.000000
mean 2.500000 1.500000 1.500000
std 1.290994 1.290994 1.290994
min 1.000000 0.000000 0.000000
25% 1.750000 0.750000 0.750000
50% 2.500000 1.500000 1.500000
75% 3.250000 2.250000 2.250000
max 4.000000 3.000000 3.000000
前五行数据:
A B C D
0 4 2019-10-28 0 0
1 3 2019-10-28 1 1
2 2 2019-10-28 2 2
3 1 2019-10-28 3 3
前两行数据:
A B C D
0 4 2019-10-28 0 0
1 3 2019-10-28 1 1
最后五行数据:
A B C D
0 4 2019-10-28 0 0
1 3 2019-10-28 1 1
2 2 2019-10-28 2 2
3 1 2019-10-28 3 3
最后两行数据:
A B C D
2 2 2019-10-28 2 2
3 1 2019-10-28 3 3
import pandas as pd
import numpy as np
d = {'A':[4,3,2,1],"B":pd.Timestamp('20191028'),'C':range(4),'D':np.arange(4)}
df = pd.DataFrame(d)
print(df)
# print('df的转置:\n')
# print(df.T)
print('根据列标签降序排序:\n')
print(df.sort_index(axis = 1,ascending=False))
print('根据行标签升序排序:\n')
print(df.sort_index(axis = 0))
print('根据A列数据升序排序:\n')
print(df.sort_values(by='A'))
A B C D
0 4 2019-10-28 0 0
1 3 2019-10-28 1 1
2 2 2019-10-28 2 2
3 1 2019-10-28 3 3
根据列标签降序排序:
D C B A
0 0 0 2019-10-28 4
1 1 1 2019-10-28 3
2 2 2 2019-10-28 2
3 3 3 2019-10-28 1
根据行标签升序排序:
A B C D
0 4 2019-10-28 0 0
1 3 2019-10-28 1 1
2 2 2019-10-28 2 2
3 1 2019-10-28 3 3
根据A列数据升序排序:
A B C D
3 1 2019-10-28 3 3
2 2 2019-10-28 2 2
1 3 2019-10-28 1 1
0 4 2019-10-28 0 0
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
print(dates)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
#可以根据索引所在序号切片,也可以通过索引的名称切片,这里名称和序号相同
#和list的切片相同,右边取不到
#这种方法效率较低,不推荐
print(data[2:4])
#推荐
#通过.loc的方法,后面只能接索引名称
#可以取到右边
print(data.loc['2019-10-28':'2019-10-30'])
# 通过.iloc的方法,后面只能接索引序号
#右边取不到
print(data.iloc[2:4])
#指定行和列
print(data.loc['2019-10-28':'2019-10-30',['B','C']])
DatetimeIndex(['2019-10-28', '2019-10-29', '2019-10-30', '2019-10-31',
'2019-11-01'],
dtype='datetime64[ns]', freq='D')
A B C D
2019-10-30 1.616449 0.070251 -0.747331 -0.581341
2019-10-31 0.081213 -1.394521 -2.250886 0.776748
A B C D
2019-10-28 1.154322 -0.777393 -0.369332 -0.132886
2019-10-29 0.576568 1.140420 -0.208502 0.270798
2019-10-30 1.616449 0.070251 -0.747331 -0.581341
A B C D
2019-10-30 1.616449 0.070251 -0.747331 -0.581341
2019-10-31 0.081213 -1.394521 -2.250886 0.776748
B C
2019-10-28 -0.777393 -0.369332
2019-10-29 1.140420 -0.208502
2019-10-30 0.070251 -0.747331
这里发现一个有意思的现象:如果是用切片的方法,最后返回的还是DataFrame格式(就算是单个一列键也不会消失),而如果是用取值的方法,最后返回的是Series
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
# print(data)
data1 = data.iloc[:,2:3]
data2 = data.iloc[:,2]
print(type(data1))
print(type(data2))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
print('------------------------')
#注意:这里的行索引值要取原本的数据类型
print(data.at[pd.Timestamp('20191028'),'B'])
print('------------------------')
print(data.iloc[1,1])
print('------------------------')
print(data.iat[1,1]) #更加高效一些
A B C D
2019-10-28 1.136702 -0.329035 0.628148 1.086971
2019-10-29 0.382521 0.555992 -0.526252 2.063961
2019-10-30 -1.557403 1.292362 0.562942 -0.642540
2019-10-31 0.271206 0.344867 -1.777534 0.763646
2019-11-01 1.629705 -1.400197 -1.490891 -1.238162
------------------------
-0.32903469608965175
------------------------
0.5559922517151386
------------------------
0.5559922517151386
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
print('------------------------')
data2 = data.copy()
#增加一列
data2['tag'] = [x for x in range(dates.shape[0])]
print(data2)
#修改单个数据
data2.iat[0,0] = 1000
#修改一列数据
data2.tag = range(5,0,-1)
print(data2)
A B C D
2019-10-28 -1.152138 0.542239 1.551530 0.748325
2019-10-29 -0.790928 1.279881 -1.171393 2.234812
2019-10-30 1.262128 -1.338714 0.039230 0.478960
2019-10-31 1.014206 -1.703972 -1.031489 -0.902610
2019-11-01 1.033968 0.239524 -0.941671 0.375400
------------------------
A B C D tag
2019-10-28 -1.152138 0.542239 1.551530 0.748325 0
2019-10-29 -0.790928 1.279881 -1.171393 2.234812 1
2019-10-30 1.262128 -1.338714 0.039230 0.478960 2
2019-10-31 1.014206 -1.703972 -1.031489 -0.902610 3
2019-11-01 1.033968 0.239524 -0.941671 0.375400 4
A B C D tag
2019-10-28 1000.000000 0.542239 1.551530 0.748325 5
2019-10-29 -0.790928 1.279881 -1.171393 2.234812 4
2019-10-30 1.262128 -1.338714 0.039230 0.478960 3
2019-10-31 1.014206 -1.703972 -1.031489 -0.902610 2
2019-11-01 1.033968 0.239524 -0.941671 0.375400 1
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
data1 = data.reindex(index = dates[0:4],columns = list(data.columns)+['E'])
data1.loc[dates[1:3],'E'] = 2
print(data1)
#去除有空值的行
data2 = data1.dropna()
print(data2)
#替换空值
data3 = data1.fillna(value=0)
print(data3)
A B C D E
2019-10-28 -0.502316 0.011895 0.479873 -0.693274 NaN
2019-10-29 -1.933145 2.588659 -0.542300 -0.858116 2.0
2019-10-30 0.403359 -0.774248 -0.570066 0.732535 2.0
2019-10-31 0.330384 1.453524 1.485526 -0.210194 NaN
A B C D E
2019-10-29 -1.933145 2.588659 -0.542300 -0.858116 2.0
2019-10-30 0.403359 -0.774248 -0.570066 0.732535 2.0
A B C D E
2019-10-28 -0.502316 0.011895 0.479873 -0.693274 0.0
2019-10-29 -1.933145 2.588659 -0.542300 -0.858116 2.0
2019-10-30 0.403359 -0.774248 -0.570066 0.732535 2.0
2019-10-31 0.330384 1.453524 1.485526 -0.210194 0.0
apply函数可以对DataFrame对象进行操作,既可以作用于一行或者一列的元素,也可以作用于单个元素。
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
data1 = data.apply(np.cumsum)
print(data1)
print('---------------------')
data2 = data.apply(lambda x: x.max() - x.min(),axis = 1)
print(data2)
A B C D
2019-10-28 0.481858 -0.804957 1.646266 -1.822953
2019-10-29 1.791598 1.638538 0.359947 -0.823500
2019-10-30 0.993991 -1.135404 -0.541761 0.726015
2019-10-31 0.711559 -1.269686 0.986044 -0.029288
2019-11-01 0.275556 -1.064297 -0.778964 -0.673782
A B C D
2019-10-28 0.481858 -0.804957 1.646266 -1.822953
2019-10-29 2.273457 0.833581 2.006213 -2.646452
2019-10-30 3.267448 -0.301823 1.464452 -1.920437
2019-10-31 3.979007 -1.571510 2.450496 -1.949725
2019-11-01 4.254563 -2.635806 1.671531 -2.623506
---------------------
2019-10-28 3.469218
2019-10-29 2.615098
2019-10-30 2.129395
2019-10-31 2.255730
2019-11-01 1.339853
Freq: D, dtype: float64
import numpy as np
import pandas as pd
a = pd.Series(np.random.randint(10,12,size = 6))
print(a)
print('--------------')
#统计每个值出现的个数,注意这个方法只有Series有
print(a.value_counts())
print('--------------')
0 10
1 11
2 11
3 10
4 11
5 11
dtype: int32
--------------
11 4
10 2
dtype: int64
--------------
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
# print(data)
#拼接操作,不同的部分存放在list里
data1 = pd.concat([data.iloc[:2],data.iloc[2:4],data.iloc[4:5]])
data2 = data1 == data
print(data2)
A B C D
2019-10-28 True True True True
2019-10-29 True True True True
2019-10-30 True True True True
2019-10-31 True True True True
2019-11-01 True True True True
DataFrame和Series按行拼接:
也将 Series 或 df 的一列直接赋给原始 df 作为一列,使用 df["f"] = df2,将df2作为df新的一列(列名为"f")
import pandas as pd
import numpy as np
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
# print(data)
a = pd.Series(np.random.randint(3,5,size = 5), index=list('ABCDE'))
data = data.append(a,ignore_index=True)
print(data)
A B C D E
0 -0.402279 -0.836888 0.005693 0.344556 NaN
1 0.129264 -2.016967 1.110841 1.889610 NaN
2 0.232075 -0.314341 0.525145 1.134156 NaN
3 1.426809 -0.406842 1.500118 0.057147 NaN
4 0.270634 0.858269 -0.339032 0.004396 NaN
5 3.000000 3.000000 3.000000 4.000000 4.0
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
'B': [2, 8, 1, 4, 3, 2, 5, 9],
'C': [102, 98, 107, 104, 115, 87, 92, 123]})
print(df)
#按照A列进行分组求和
print(df.groupby('A').sum())
#按照A和B列进行分组求和
print(df.groupby(['A','B']).sum())
A B C
0 a 2 102
1 b 8 98
2 a 1 107
3 c 4 104
4 a 3 115
5 c 2 87
6 b 5 92
7 c 9 123
B C
A
a 6 324
b 13 190
c 15 314
C
A B
a 1 107
2 102
3 115
b 5 92
8 98
c 2 87
4 104
9 123
行多层索引
import pandas as pd
import numpy as np
df = pd.DataFrame({'class':['A','A','A','B','B','B','C','C'],
'id':['a','b','c','a','b','c','a','b'],
'value':[1,2,3,4,5,6,7,8]})
#按class和id分组
df.set_index(['class', 'id'],inplace=True)
print(df)
value
class id
A a 1
b 2
c 3
B a 4
b 5
c 6
C a 7
b 8
列多层索引
import pandas as pd
import numpy as np
dfmi = pd.DataFrame([list('abcd'),
list('efgh'),
list('ijkl'),
list('mnop')],
columns=pd.MultiIndex.from_product([['one', 'two'],['first', 'second']]))
print(dfmi)
one two
first second first second
0 a b c d
1 e f g h
2 i j k l
3 m n o p
1.时间的生成和采样:
随机生成一个模拟股票交易数据:
import pandas as pd
import numpy as np
#开始时间20191028,600个时间点,秒为间隔
rng = pd.date_range('20191028',periods = 600,freq = 's')
print(rng)
s = pd.Series(np.random.randint(0,500,len(rng)),index = rng)
print(s)
DatetimeIndex(['2019-10-28 00:00:00', '2019-10-28 00:00:01',
'2019-10-28 00:00:02', '2019-10-28 00:00:03',
'2019-10-28 00:00:04', '2019-10-28 00:00:05',
'2019-10-28 00:00:06', '2019-10-28 00:00:07',
'2019-10-28 00:00:08', '2019-10-28 00:00:09',
...
'2019-10-28 00:09:50', '2019-10-28 00:09:51',
'2019-10-28 00:09:52', '2019-10-28 00:09:53',
'2019-10-28 00:09:54', '2019-10-28 00:09:55',
'2019-10-28 00:09:56', '2019-10-28 00:09:57',
'2019-10-28 00:09:58', '2019-10-28 00:09:59'],
dtype='datetime64[ns]', length=600, freq='S')
2019-10-28 00:00:00 183
2019-10-28 00:00:01 373
2019-10-28 00:00:02 368
...
2019-10-28 00:09:54 148
2019-10-28 00:09:55 302
2019-10-28 00:09:56 6
2019-10-28 00:09:57 389
2019-10-28 00:09:58 167
2019-10-28 00:09:59 245
Freq: S, Length: 600, dtype: int32
#每2min采样求和
#不推荐使用 s = s.resample('2Min',how = 'mean')
s = s.resample('2Min').sum()
print(s)
2019-10-28 00:00:00 28813
2019-10-28 00:02:00 29213
2019-10-28 00:04:00 27418
2019-10-28 00:06:00 31164
2019-10-28 00:08:00 28702
Freq: 2T, dtype: int32
import pandas as pd
import numpy as np
#生成季度时间
rng = pd.period_range('2017Q1','2019Q1',freq='Q')
print(rng)
#将季度时间转换成时间日期的格式
s = rng.to_timestamp()
print(s)
PeriodIndex(['2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2',
'2018Q3', '2018Q4', '2019Q1'],
dtype='period[Q-DEC]', freq='Q-DEC')
DatetimeIndex(['2017-01-01', '2017-04-01', '2017-07-01', '2017-10-01',
'2018-01-01', '2018-04-01', '2018-07-01', '2018-10-01',
'2019-01-01'],
dtype='datetime64[ns]', freq='QS-OCT')
import pandas as pd
import numpy as np
print(pd.Timestamp('20160301') - pd.Timestamp('20160201'))
print(pd.Timestamp('20160301') + pd.Timedelta(days=5))
29 days 00:00:00
2016-03-06 00:00:00
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':[1,2,3,4,5,6],'raw_grade':['a','b','b','a','a','d']})
# print(df)
#加上grade一列,类型为category类型
df['grade'] = df.raw_grade.astype('category')
print(df.grade)
#使用.cat.categories可以查看和改变类别标签
print(df["grade"].cat.categories)
#a对应very good,b对应good,c对应very bad。操作完成之后,原来的标签a就变成了very good标签。
df["grade"].cat.categories = ["very good", "good", "very bad"]
print(df)
0 a
1 b
2 b
3 a
4 a
5 d
Name: grade, dtype: category
Categories (3, object): [a, b, d]
Index(['a', 'b', 'd'], dtype='object')
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 d very bad
数据的保存
dates = pd.date_range('20191028',periods=5)
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
data.to_csv('data.csv')
数据的读取
pd.read_csv('data.csv')