start 开始时间
end 结束时间
periods 时间长度
freq 时间频率,默认为‘D’, 可选H(our) , W(eek), B(usiness), S(emi-)M(onth),(min)T(es), S(econd), A(year),…
In [23]: import datetime
In [24]: datetime.datetime.strptime('2010-01-01', '%Y-%m-%d') # 将字符串转成时间
Out[24]: datetime.datetime(2010, 1, 1, 0, 0)
In [25]: # 第一个参数是时间字符串,第二个参数是时间格式
In [26]: import dateutil # 可能会出错 from dateutil import parser
In [27]: dateutil.parser.parse('2001-01-01')
Out[27]: datetime.datetime(2001, 1, 1, 0, 0)
In [28]: # 像 datatime 一样 但这个省去时间格式
In [29]: dateutil.parser.parse('2001/01/01')
Out[29]: datetime.datetime(2001, 1, 1, 0, 0)
In [30]: dateutil.parser.parse('01/01/2001')
Out[30]: datetime.datetime(2001, 1, 1, 0, 0)
In [31]: dateutil.parser.parse('JAN/01/2001')
Out[31]: datetime.datetime(2001, 1, 1, 0, 0)
In [33]: pd.to_datetime(['2001-01-01', '2010/Feb/02'])
Out[33]: DatetimeIndex(['2001-01-01', '2010-02-02'], dtype='datetime64[ns]', freq=None)
In [34]: # 将不同的时间对象字符串转成时间
In [35]: pd.date_range('2010-01-01','2010-01-15')
Out[35]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
'2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
'2010-01-09', '2010-01-10', '2010-01-11', '2010-01-12',
'2010-01-13', '2010-01-14', '2010-01-15'],
dtype='datetime64[ns]', freq='D')
In [36]: pd.date_range('2010-01-01',periods=20)
Out[36]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
'2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
'2010-01-09', '2010-01-10', '2010-01-11', '2010-01-12',
'2010-01-13', '2010-01-14', '2010-01-15', '2010-01-16',
'2010-01-17', '2010-01-18', '2010-01-19', '2010-01-20'],
dtype='datetime64[ns]', freq='D')
In [37]: # periods 是要生成的时间长度
In [38]: pd.date_range('2010-01-01',periods=20, freq='W')
Out[38]:
DatetimeIndex(['2010-01-03', '2010-01-10', '2010-01-17', '2010-01-24',
'2010-01-31', '2010-02-07', '2010-02-14', '2010-02-21',
'2010-02-28', '2010-03-07', '2010-03-14', '2010-03-21',
'2010-03-28', '2010-04-04', '2010-04-11', '2010-04-18',
'2010-04-25', '2010-05-02', '2010-05-09', '2010-05-16'],
dtype='datetime64[ns]', freq='W-SUN')
In [39]: # 按照每周
In [40]: pd.date_range('2010-01-01',periods=20, freq='W-MON')
Out[40]:
DatetimeIndex(['2010-01-04', '2010-01-11', '2010-01-18', '2010-01-25',
'2010-02-01', '2010-02-08', '2010-02-15', '2010-02-22',
'2010-03-01', '2010-03-08', '2010-03-15', '2010-03-22',
'2010-03-29', '2010-04-05', '2010-04-12', '2010-04-19',
'2010-04-26', '2010-05-03', '2010-05-10', '2010-05-17'],
dtype='datetime64[ns]', freq='W-MON')
In [41]: # 按照每周一
In [42]: pd.date_range('2010-01-01',periods=20, freq='B')
Out[42]:
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
'2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
'2010-01-13', '2010-01-14', '2010-01-15', '2010-01-18',
'2010-01-19', '2010-01-20', '2010-01-21', '2010-01-22',
'2010-01-25', '2010-01-26', '2010-01-27', '2010-01-28'],
dtype='datetime64[ns]', freq='B')
In [43]: # 去掉周六/日
In [45]: dt = _
In [46]: dt[0]
Out[46]: Timestamp('2010-01-01 00:00:00', offset='B')
In [47]: dt[0].to_pydatetime()
Out[47]: datetime.datetime(2010, 1, 1, 0, 0)
In [48]: # 转成 python 的 datatime
In [49]: pd.date_range('2010-01-01',periods=20, freq='1h20min')
Out[49]:
DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:20:00',
'2010-01-01 02:40:00', '2010-01-01 04:00:00',
'2010-01-01 05:20:00', '2010-01-01 06:40:00',
'2010-01-01 08:00:00', '2010-01-01 09:20:00',
'2010-01-01 10:40:00', '2010-01-01 12:00:00',
'2010-01-01 13:20:00', '2010-01-01 14:40:00',
'2010-01-01 16:00:00', '2010-01-01 17:20:00',
'2010-01-01 18:40:00', '2010-01-01 20:00:00',
'2010-01-01 21:20:00', '2010-01-01 22:40:00',
'2010-01-02 00:00:00', '2010-01-02 01:20:00'],
dtype='datetime64[ns]', freq='80T')
In [50]: # 任意时间
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'one':[1,2,3,4], 'two':[4,5,6,7]}, index=['a','b','c'
...: ,'d'])
In [4]: df
Out[4]:
one two
a 1 4
b 2 5
c 3 6
d 4 7
In [5]: df.mean()
Out[5]:
one 2.5
two 5.5
dtype: float64
In [6]: df.mean(axis=1)
Out[6]:
a 2.5
b 3.5
c 4.5
d 5.5
dtype: float64
In [7]: # 按行求平均值
In [8]: df.sum()
Out[8]:
one 10
two 22
dtype: int64
In [9]: df.sum(axis=1)
Out[9]:
a 5
b 7
c 9
d 11
dtype: int64
In [10]: # 按行求和
In [11]: df.sort_values(by='one')
Out[11]:
one two
a 1 4
b 2 5
c 3 6
d 4 7
In [12]: df.sort_values(by='one', ascending=False)
Out[12]:
one two
d 4 7
c 3 6
b 2 5
a 1 4
In [13]: # ascending 按列 one 降序排列
In [16]: # 当列中有 NaN 时 ascending 排序 NaN 始终在最后
In [17]: df.sort_index()
Out[17]:
one two
a 1 4
b 2 5
c 3 6
d 4 7
In [18]: # 按索引排序
In [19]: df.sort_index(ascending=False)
Out[19]:
one two
d 4 7
c 3 6
b 2 5
a 1 4
In [20]: # 降序列排序
In [21]: df.sort_index(ascending=False, axis=1)
Out[21]:
two one
a 4 1
b 5 2
c 6 3
d 7 4
In [22]: # 列索引排序
In [23]:
时间序列就是以时间为索引的Series或DataFrame
datetime对象作为索引时时存储在DatetimeIndex对象中的
时间序列特殊功能:
传入“年” 或着 “年月” 作为切片方式
传入日期范围作为切片方式
函数支持:resample(), truncate(),…
In [2]: import pandas as pd
In [2]: import pandas as pd
In [3]: pd.date_range("2010-01-01", '2010-01-20')
Out[3]:
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',
'2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
'2010-01-09', '2010-01-10', '2010-01-11', '2010-01-12',
'2010-01-13', '2010-01-14', '2010-01-15', '2010-01-16',
'2010-01-17', '2010-01-18', '2010-01-19', '2010-01-20'],
dtype='datetime64[ns]', freq='D')
In [4]: import numpy as np
In [5]: sr = pd.Series(np.arange(20),index=pd.date_range('2017-01-01',periods=20
...: ))
In [6]: sr
Out[6]:
2017-01-01 0
2017-01-02 1
2017-01-03 2
2017-01-04 3
2017-01-05 4
2017-01-06 5
2017-01-07 6
2017-01-08 7
2017-01-09 8
2017-01-10 9
2017-01-11 10
2017-01-12 11
2017-01-13 12
2017-01-14 13
2017-01-15 14
2017-01-16 15
2017-01-17 16
2017-01-18 17
2017-01-19 18
2017-01-20 19
Freq: D, dtype: int64
In [7]: sr.index
Out[7]:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12',
'2017-01-13', '2017-01-14', '2017-01-15', '2017-01-16',
'2017-01-17', '2017-01-18', '2017-01-19', '2017-01-20'],
dtype='datetime64[ns]', freq='D')
In [8]: # 我们在这创建的 sr 的 index 是 DatetimeIndex类型的
In [9]: # 此时sr 就是一个时间序列
In [10]: sr['2017-01'] # 获取2017年1月的数据
Out[10]:
2017-01-01 0
2017-01-02 1
2017-01-03 2
2017-01-04 3
2017-01-05 4
2017-01-06 5
2017-01-07 6
2017-01-08 7
2017-01-09 8
2017-01-10 9
2017-01-11 10
2017-01-12 11
2017-01-13 12
2017-01-14 13
2017-01-15 14
2017-01-16 15
2017-01-17 16
2017-01-18 17
2017-01-19 18
2017-01-20 19
Freq: D, dtype: int64
In [11]: sr['2017-01-01':'2017-01-09']
Out[11]:
2017-01-01 0
2017-01-02 1
2017-01-03 2
2017-01-04 3
2017-01-05 4
2017-01-06 5
2017-01-07 6
2017-01-08 7
2017-01-09 8
Freq: D, dtype: int64
In [12]: sr.resample('W').sum()
Out[12]:
2017-01-01 0
2017-01-08 28
2017-01-15 77
2017-01-22 85
Freq: W-SUN, dtype: int64
In [13]: # 一周数字总和
In [14]: sr.resample('M').sum()
Out[14]:
2017-01-31 190
Freq: M, dtype: int64
In [15]: # 一月数字总和
In [16]: sr.resample('M').mean()
Out[16]:
2017-01-31 9.5
Freq: M, dtype: float64
In [17]: sr.truncate(before='2017-01-03') # 还有个 after 可以使用
Out[17]:
2017-01-03 2
2017-01-04 3
2017-01-05 4
2017-01-06 5
2017-01-07 6
2017-01-08 7
2017-01-09 8
2017-01-10 9
2017-01-11 10
2017-01-12 11
2017-01-13 12
2017-01-14 13
2017-01-15 14
2017-01-16 15
2017-01-17 16
2017-01-18 17
2017-01-19 18
2017-01-20 19
Freq: D, dtype: int64
In [18]:
数据文件常用格式:csv
pandas读取文件:从。文件名、url、文件对戏那个中加载数据
read_csv 默认分隔符为逗号 “,”
read_table 默认分隔符为制表符 “\t”
read_csv、 read_table 函数主要参数:
sep 指定分隔符,可以使用正则表达式如 ‘\s+’
header=None 指定文件没有列名
names 指定列名
index_col 指定某列索引
skiprows 跳过某些行 [1,2,3 ] # 即表示为跳过1,2,3行
na_values 指定某些字符串表示缺失值 [‘None’,‘null’ ] # 指定None 和 null字符串被解释成NaN
parse_dates 指定某些列被解析成日期,类型为bool值或者列表[]
In [19]: # pd.read_csv("文件名.csv", index_col='') # index_col 表示指定某一列作
...: 为行索引
In [20]: # pd.read_csv("文件名.csv", index_col=0) # index_col 表示指定第一列作为
...: 行索引
In [21]: # pd.read_csv("文件名.csv", index_col=0, parse_dates=True) # parse_date
...: s将文件中所有第时间对象转换成
In [22]: # 把能解释成时间序列第列都解释出来
In [23]: # pd.read_csv("文件名.csv", index_col=0, parse_dates=['列',‘列’]) # pa
...: rse_dates将文件中指定列名的时间对象转换成时间序列
In [24]:
In [24]: # 有些csv文件可能没有列名,当读取的时候会把第一行数据当作列名,这样就不
...: 太好了
In [25]: # pd.read_csv('文件名.csv', header=None) # 当指定 header=None 的时候 pd
...: 就会自己创建一个列名0123... 这样就不会影响数据了
In [26]: # pd.read_csv('文件名.csv', header=None, names=['列名1', ‘列名2’]) #
...: 当指定 header=None 的时候 pd就会自己创建一个我们提供的names列表作为列名
In [27]: # read_table 和read_csv没有大区别,table默认的分隔符为制表符
In [28]:
写入到csv文件:to_csv 函数
主要参数
sep 指定文件分隔符
na_rep 指定缺失值转换到字符串,默认为空字符串
header=False 不输出列名一行
index=False 不输出行索引一列
columns 指定输出到列,传输列表
Pandas 支持其他的文件类型:
json、xml、html、数据库、pickle、excel…