python数据科学包(二)—— pandas快速入门

目录

一. series

Series是一种类似于一维数组的对象,它由一组数据(各种Numpy数据类型)以及一组与之相关的数据标签(即索引)组成。
Series的字符串表现形式为:索引在左边,值在右边。
import pandas as pd
import numpy as np
 
s = pd.Series([1,3,4,np.NaN,8,4])
print(s)
0    1.0
1    3.0
2    4.0
3    NaN
4    8.0
5    4.0
dtype: float64

二. DataFrame

2.1 创建

DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(共用同一个索引)。

通过指定内容、索引、列来创建DataFrame:
import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
print(dates)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
print(data.shape)
print(data.values) #值为一个二维array
DatetimeIndex(['2019-10-28', '2019-10-29', '2019-10-30', '2019-10-31',
               '2019-11-01'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2019-10-28  0.787455 -1.041718  0.577362  0.889686
2019-10-29 -0.017887  0.630519 -0.550084  0.373894
2019-10-30  1.969060 -1.010220 -0.166625 -1.247600
2019-10-31 -2.103929 -0.359340 -1.139270  1.283892
2019-11-01 -2.059919  0.690421  0.228301 -0.961284
(5, 4)
[[ 0.7874552  -1.04171802  0.5773621   0.88968563]
 [-0.01788734  0.63051917 -0.5500839   0.37389389]
 [ 1.9690596  -1.01022027 -0.16662497 -1.2476003 ]
 [-2.10392864 -0.35933967 -1.13927039  1.28389242]
 [-2.05991921  0.69042131  0.22830095 -0.961284  ]]
通过字典的形式创建:
import pandas as pd
import numpy as np
 
d = {'A':[4,3,2,1],"B":pd.Timestamp('20191028'),'C':range(4),'D':np.arange(4)}
print(d)
df = pd.DataFrame(d)
print(df)
print('[df.dtypes]:\n',df.dtypes) #每一列的数据类型
print('[df.A]:\n',df.A) #A这一列的索引+数据
print('[type(df.A)]:\n',type(df.A)) #A这一列是series类型
{'A': [4, 3, 2, 1], 'B': Timestamp('2019-10-28 00:00:00'), 'C': range(0, 4), 'D': array([0, 1, 2, 3])}
   A          B  C  D
0  4 2019-10-28  0  0
1  3 2019-10-28  1  1
2  2 2019-10-28  2  2
3  1 2019-10-28  3  3
[df.dtypes]:
 A             int64
B    datetime64[ns]
C             int64
D             int32
dtype: object
[df.A]:
 0    4
1    3
2    2
3    1
Name: A, dtype: int64
[type(df.A)]:
 <class 'pandas.core.series.Series'>

2.2 查看数据

import pandas as pd
import numpy as np
 
d = {'A':[4,3,2,1],"B":pd.Timestamp('20191028'),'C':range(4),'D':np.arange(4)}
df = pd.DataFrame(d)
print('整体情况:\n')
print(df.describe())
print('前五行数据:\n')
print(df.head())
print('前两行数据:\n')
print(df.head(2))
print('最后五行数据:\n')
print(df.tail())
print('最后两行数据:\n')
print(df.tail(2))
整体情况:
 
              A         C         D
count  4.000000  4.000000  4.000000
mean   2.500000  1.500000  1.500000
std    1.290994  1.290994  1.290994
min    1.000000  0.000000  0.000000
25%    1.750000  0.750000  0.750000
50%    2.500000  1.500000  1.500000
75%    3.250000  2.250000  2.250000
max    4.000000  3.000000  3.000000
前五行数据:
 
   A          B  C  D
0  4 2019-10-28  0  0
1  3 2019-10-28  1  1
2  2 2019-10-28  2  2
3  1 2019-10-28  3  3
前两行数据:
 
   A          B  C  D
0  4 2019-10-28  0  0
1  3 2019-10-28  1  1
最后五行数据:
 
   A          B  C  D
0  4 2019-10-28  0  0
1  3 2019-10-28  1  1
2  2 2019-10-28  2  2
3  1 2019-10-28  3  3
最后两行数据:
 
   A          B  C  D
2  2 2019-10-28  2  2
3  1 2019-10-28  3  3

2.3 排序

import pandas as pd
import numpy as np
 
d = {'A':[4,3,2,1],"B":pd.Timestamp('20191028'),'C':range(4),'D':np.arange(4)}
df = pd.DataFrame(d)
print(df)
# print('df的转置:\n')
# print(df.T)
print('根据列标签降序排序:\n')
print(df.sort_index(axis = 1,ascending=False))
print('根据行标签升序排序:\n')
print(df.sort_index(axis = 0))
print('根据A列数据升序排序:\n')
print(df.sort_values(by='A'))
   A          B  C  D
0  4 2019-10-28  0  0
1  3 2019-10-28  1  1
2  2 2019-10-28  2  2
3  1 2019-10-28  3  3
根据列标签降序排序:
 
   D  C          B  A
0  0  0 2019-10-28  4
1  1  1 2019-10-28  3
2  2  2 2019-10-28  2
3  3  3 2019-10-28  1
根据行标签升序排序:
 
   A          B  C  D
0  4 2019-10-28  0  0
1  3 2019-10-28  1  1
2  2 2019-10-28  2  2
3  1 2019-10-28  3  3
根据A列数据升序排序:
 
   A          B  C  D
3  1 2019-10-28  3  3
2  2 2019-10-28  2  2
1  3 2019-10-28  1  1
0  4 2019-10-28  0  0

2.4 切片

import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
print(dates)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
 
#可以根据索引所在序号切片,也可以通过索引的名称切片,这里名称和序号相同
#和list的切片相同,右边取不到
#这种方法效率较低,不推荐
print(data[2:4])
 
#推荐
#通过.loc的方法,后面只能接索引名称
#可以取到右边
print(data.loc['2019-10-28':'2019-10-30'])
 
# 通过.iloc的方法,后面只能接索引序号
#右边取不到
print(data.iloc[2:4])
 
#指定行和列
print(data.loc['2019-10-28':'2019-10-30',['B','C']])
DatetimeIndex(['2019-10-28', '2019-10-29', '2019-10-30', '2019-10-31',
               '2019-11-01'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2019-10-30  1.616449  0.070251 -0.747331 -0.581341
2019-10-31  0.081213 -1.394521 -2.250886  0.776748
                   A         B         C         D
2019-10-28  1.154322 -0.777393 -0.369332 -0.132886
2019-10-29  0.576568  1.140420 -0.208502  0.270798
2019-10-30  1.616449  0.070251 -0.747331 -0.581341
                   A         B         C         D
2019-10-30  1.616449  0.070251 -0.747331 -0.581341
2019-10-31  0.081213 -1.394521 -2.250886  0.776748
                   B         C
2019-10-28 -0.777393 -0.369332
2019-10-29  1.140420 -0.208502
2019-10-30  0.070251 -0.747331
这里发现一个有意思的现象:如果是用切片的方法,最后返回的还是DataFrame格式(就算是单个一列键也不会消失),而如果是用取值的方法,最后返回的是Series
import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
# print(data)
 
data1 = data.iloc[:,2:3]
data2 = data.iloc[:,2]
print(type(data1))
print(type(data2))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

2.5 取单个数据

import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
print('------------------------')
#注意:这里的行索引值要取原本的数据类型
print(data.at[pd.Timestamp('20191028'),'B'])
print('------------------------')
print(data.iloc[1,1])
print('------------------------')
print(data.iat[1,1]) #更加高效一些
                  A         B         C         D
2019-10-28  1.136702 -0.329035  0.628148  1.086971
2019-10-29  0.382521  0.555992 -0.526252  2.063961
2019-10-30 -1.557403  1.292362  0.562942 -0.642540
2019-10-31  0.271206  0.344867 -1.777534  0.763646
2019-11-01  1.629705 -1.400197 -1.490891 -1.238162
------------------------
-0.32903469608965175
------------------------
0.5559922517151386
------------------------
0.5559922517151386

2.6 修改表格中的数据

import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
print('------------------------')
data2 = data.copy()
#增加一列
data2['tag'] = [x for x in range(dates.shape[0])]
print(data2)
 
#修改单个数据
data2.iat[0,0] = 1000
 
#修改一列数据
data2.tag = range(5,0,-1)
print(data2)
                   A         B         C         D
2019-10-28 -1.152138  0.542239  1.551530  0.748325
2019-10-29 -0.790928  1.279881 -1.171393  2.234812
2019-10-30  1.262128 -1.338714  0.039230  0.478960
2019-10-31  1.014206 -1.703972 -1.031489 -0.902610
2019-11-01  1.033968  0.239524 -0.941671  0.375400
------------------------
                   A         B         C         D  tag
2019-10-28 -1.152138  0.542239  1.551530  0.748325    0
2019-10-29 -0.790928  1.279881 -1.171393  2.234812    1
2019-10-30  1.262128 -1.338714  0.039230  0.478960    2
2019-10-31  1.014206 -1.703972 -1.031489 -0.902610    3
2019-11-01  1.033968  0.239524 -0.941671  0.375400    4
                      A         B         C         D  tag
2019-10-28  1000.000000  0.542239  1.551530  0.748325    5
2019-10-29    -0.790928  1.279881 -1.171393  2.234812    4
2019-10-30     1.262128 -1.338714  0.039230  0.478960    3
2019-10-31     1.014206 -1.703972 -1.031489 -0.902610    2
2019-11-01     1.033968  0.239524 -0.941671  0.375400    1

2.7 处理空值

import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
 
data1 = data.reindex(index = dates[0:4],columns = list(data.columns)+['E'])
data1.loc[dates[1:3],'E'] = 2
print(data1)
 
#去除有空值的行
data2 = data1.dropna()
print(data2)
 
#替换空值
data3 = data1.fillna(value=0)
print(data3)
                   A         B         C         D    E
2019-10-28 -0.502316  0.011895  0.479873 -0.693274  NaN
2019-10-29 -1.933145  2.588659 -0.542300 -0.858116  2.0
2019-10-30  0.403359 -0.774248 -0.570066  0.732535  2.0
2019-10-31  0.330384  1.453524  1.485526 -0.210194  NaN
                   A         B         C         D    E
2019-10-29 -1.933145  2.588659 -0.542300 -0.858116  2.0
2019-10-30  0.403359 -0.774248 -0.570066  0.732535  2.0
                   A         B         C         D    E
2019-10-28 -0.502316  0.011895  0.479873 -0.693274  0.0
2019-10-29 -1.933145  2.588659 -0.542300 -0.858116  2.0
2019-10-30  0.403359 -0.774248 -0.570066  0.732535  2.0
2019-10-31  0.330384  1.453524  1.485526 -0.210194  0.0

2.8 apple函数

apply函数可以对DataFrame对象进行操作,既可以作用于一行或者一列的元素,也可以作用于单个元素。
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
print(data)
 
data1 = data.apply(np.cumsum)
print(data1)
print('---------------------')
data2 = data.apply(lambda x: x.max() - x.min(),axis = 1)
print(data2)
                   A         B         C         D
2019-10-28  0.481858 -0.804957  1.646266 -1.822953
2019-10-29  1.791598  1.638538  0.359947 -0.823500
2019-10-30  0.993991 -1.135404 -0.541761  0.726015
2019-10-31  0.711559 -1.269686  0.986044 -0.029288
2019-11-01  0.275556 -1.064297 -0.778964 -0.673782
                   A         B         C         D
2019-10-28  0.481858 -0.804957  1.646266 -1.822953
2019-10-29  2.273457  0.833581  2.006213 -2.646452
2019-10-30  3.267448 -0.301823  1.464452 -1.920437
2019-10-31  3.979007 -1.571510  2.450496 -1.949725
2019-11-01  4.254563 -2.635806  1.671531 -2.623506
---------------------
2019-10-28    3.469218
2019-10-29    2.615098
2019-10-30    2.129395
2019-10-31    2.255730
2019-11-01    1.339853
Freq: D, dtype: float64

2.9 统计频次

import numpy as np
import pandas as pd
 
a = pd.Series(np.random.randint(10,12,size = 6))
print(a)
print('--------------')
#统计每个值出现的个数,注意这个方法只有Series有
print(a.value_counts())
print('--------------')
0    10
1    11
2    11
3    10
4    11
5    11
dtype: int32
--------------
11    4
10    2
dtype: int64
--------------

2.10 拼接

import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
# print(data)
 
#拼接操作,不同的部分存放在list里
data1 = pd.concat([data.iloc[:2],data.iloc[2:4],data.iloc[4:5]])
data2 = data1 == data
print(data2)
               A     B     C     D
2019-10-28  True  True  True  True
2019-10-29  True  True  True  True
2019-10-30  True  True  True  True
2019-10-31  True  True  True  True
2019-11-01  True  True  True  True
DataFrame和Series按行拼接:
也将 Series 或 df 的一列直接赋给原始 df 作为一列,使用 df["f"] = df2,将df2作为df新的一列(列名为"f")
import pandas as pd
import numpy as np
 
dates = pd.date_range('20191028',periods=5)
#通过指定内容、索引、列来创建DataFrame
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
# print(data)
a = pd.Series(np.random.randint(3,5,size = 5), index=list('ABCDE'))
 
data = data.append(a,ignore_index=True)
print(data)
          A         B         C         D    E
0 -0.402279 -0.836888  0.005693  0.344556  NaN
1  0.129264 -2.016967  1.110841  1.889610  NaN
2  0.232075 -0.314341  0.525145  1.134156  NaN
3  1.426809 -0.406842  1.500118  0.057147  NaN
4  0.270634  0.858269 -0.339032  0.004396  NaN
5  3.000000  3.000000  3.000000  4.000000  4.0

2.11 分组

import pandas as pd
import numpy as np
 
df = pd.DataFrame({'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
                  'B': [2, 8, 1, 4, 3, 2, 5, 9],
                  'C': [102, 98, 107, 104, 115, 87, 92, 123]})
print(df)
 
#按照A列进行分组求和
print(df.groupby('A').sum())
#按照A和B列进行分组求和
print(df.groupby(['A','B']).sum())
   A  B    C
0  a  2  102
1  b  8   98
2  a  1  107
3  c  4  104
4  a  3  115
5  c  2   87
6  b  5   92
7  c  9  123
    B    C
A         
a   6  324
b  13  190
c  15  314
       C
A B     
a 1  107
  2  102 
  3  115
b 5   92
  8   98
c 2   87
  4  104
  9  123

2.12 多重索引

行多层索引
import pandas as pd
import numpy as np
 
 
df = pd.DataFrame({'class':['A','A','A','B','B','B','C','C'],
                    'id':['a','b','c','a','b','c','a','b'],
                    'value':[1,2,3,4,5,6,7,8]})
#按class和id分组
df.set_index(['class', 'id'],inplace=True)
print(df)
          value
class id       
A     a       1
      b       2
      c       3
B     a       4
      b       5
      c       6
C     a       7
      b       8
列多层索引
import pandas as pd
import numpy as np
 
dfmi = pd.DataFrame([list('abcd'),
                    list('efgh'),
                    list('ijkl'),
                    list('mnop')],
columns=pd.MultiIndex.from_product([['one', 'two'],['first', 'second']]))
print(dfmi)
    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

2.13 时间序列

1.时间的生成和采样:
随机生成一个模拟股票交易数据:

import pandas as pd
import numpy as np
 
#开始时间20191028,600个时间点,秒为间隔
rng = pd.date_range('20191028',periods = 600,freq = 's')
print(rng)
 
s = pd.Series(np.random.randint(0,500,len(rng)),index = rng)
print(s)
DatetimeIndex(['2019-10-28 00:00:00', '2019-10-28 00:00:01',
               '2019-10-28 00:00:02', '2019-10-28 00:00:03',
               '2019-10-28 00:00:04', '2019-10-28 00:00:05',
               '2019-10-28 00:00:06', '2019-10-28 00:00:07',
               '2019-10-28 00:00:08', '2019-10-28 00:00:09',
               ...
               '2019-10-28 00:09:50', '2019-10-28 00:09:51',
               '2019-10-28 00:09:52', '2019-10-28 00:09:53',
               '2019-10-28 00:09:54', '2019-10-28 00:09:55',
               '2019-10-28 00:09:56', '2019-10-28 00:09:57',
               '2019-10-28 00:09:58', '2019-10-28 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')
2019-10-28 00:00:00    183
2019-10-28 00:00:01    373
2019-10-28 00:00:02    368
 
 
                      ... 
 
 
2019-10-28 00:09:54    148
2019-10-28 00:09:55    302
2019-10-28 00:09:56      6
2019-10-28 00:09:57    389
2019-10-28 00:09:58    167
2019-10-28 00:09:59    245
Freq: S, Length: 600, dtype: int32
#每2min采样求和
#不推荐使用 s = s.resample('2Min',how = 'mean')
s = s.resample('2Min').sum()
print(s)
2019-10-28 00:00:00    28813
2019-10-28 00:02:00    29213
2019-10-28 00:04:00    27418
2019-10-28 00:06:00    31164
2019-10-28 00:08:00    28702
Freq: 2T, dtype: int32
import pandas as pd
import numpy as np
 
#生成季度时间
rng = pd.period_range('2017Q1','2019Q1',freq='Q')
print(rng)
 
#将季度时间转换成时间日期的格式
s = rng.to_timestamp()
print(s)
PeriodIndex(['2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2',
             '2018Q3', '2018Q4', '2019Q1'],
            dtype='period[Q-DEC]', freq='Q-DEC')
DatetimeIndex(['2017-01-01', '2017-04-01', '2017-07-01', '2017-10-01',
               '2018-01-01', '2018-04-01', '2018-07-01', '2018-10-01',
               '2019-01-01'],
              dtype='datetime64[ns]', freq='QS-OCT')
  1. 时间日期的计算
import pandas as pd
import numpy as np
 
print(pd.Timestamp('20160301') - pd.Timestamp('20160201'))
 
print(pd.Timestamp('20160301') + pd.Timedelta(days=5))
29 days 00:00:00
2016-03-06 00:00:00

2.14 Categorical 数据

import pandas as pd
import numpy as np
 
df = pd.DataFrame({'id':[1,2,3,4,5,6],'raw_grade':['a','b','b','a','a','d']})
# print(df)
 
#加上grade一列,类型为category类型
df['grade'] = df.raw_grade.astype('category')
print(df.grade)
 
#使用.cat.categories可以查看和改变类别标签
print(df["grade"].cat.categories)
#a对应very good,b对应good,c对应very bad。操作完成之后,原来的标签a就变成了very good标签。
df["grade"].cat.categories = ["very good", "good", "very bad"]
print(df)
0    a
1    b
2    b
3    a
4    a
5    d
Name: grade, dtype: category
Categories (3, object): [a, b, d]
Index(['a', 'b', 'd'], dtype='object')
   id raw_grade      grade
0   1         a  very good
1   2         b       good
2   3         b       good
3   4         a  very good
4   5         a  very good
5   6         d   very bad

2.15 数据的存取

数据的保存

dates = pd.date_range('20191028',periods=5)
data = pd.DataFrame(np.random.randn(5,4),index=dates,columns = list('ABCD'))
 
data.to_csv('data.csv')

数据的读取

pd.read_csv('data.csv')

你可能感兴趣的:(数据科学包)