气象datetime处理合集【Updating】

前言

使用Pandas

import pandas as pd
import numpy as np
import datetime

刚刚开始学python。基于个人科研过程中遇到的问题做一下笔记。

将日期设置为索引:pd.to_datetime

气象中经常要计算年/季/月平均,若是处理站点数据还涉及hourly的数据,因此将日期作为索引可以方便后续的处理。

常用使用方法

  1. 由DataFrame的多列组成. 列名可以是 [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) 或者类似的词。
df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]})
pd.to_datetime(df)
Out[9]: 
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]
  1. 由字符串转换而成。可以用format='%Y%m%d'之类的来指定格式。
pd.to_datetime('197901010600')
Out[2]: Timestamp('1979-01-01 06:00:00')
pd.to_datetime('1979-01-01 00') # 日期和时间之间需要有个空格
Out[7]: Timestamp('1979-01-01 00:00:00')

需注意,使用to_datetime生成的Timestamp有范围限制:

In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')

In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')

因此如果超出限制,不会生成Timestamp,只会显示datetime.datetime格式。

In [5]: pd.to_datetime('13000101',format='%Y%m%d', errors='ignore')
Out[5]: datetime.datetime(1300, 1, 1, 0, 0)

可以用errors='ignore'将其置为NaT

In [6]: pd.to_datetime('13000101',format='%Y%m%d', errors='coerce')
Out[6]: NaT

实践示例

示例一

数据为ISD站点数据,原文件为以逗号分隔的csv格式。部分数据用excel预览如下

ISD station data

可见原数据已经有列名,且日期单独为一列,可以直接用pd.read_csv读取

df = pd.read_csv(file,  dtype={'STATION':str})        
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.set_index('DATE')       # set date as index

示例二

如果日期是如下图更常见的以空格分隔的形式,同样可以用用pd.to_datetime转换。
各列依次为:station ID, station type, year, month, day, standard report hour, actual report time...

NCEP ADP decoded data

colNames =  ['id','fmtflag', 'year', 'month', 'day', 'hour', 'time', 'lat', 'lon', 'elev', 
                      'ww', 'pw', 'slp', 'stp', 't', 'td'] #present weather, past weather, sea level pressure, station pressure
data = pd.read_table(file,  header=None, names=colNames, 
                     delim_whitespace=True, dtype={'id':str}) # read_csv也行
df['date'] = pd.to_datetime(df.loc[:, ['year', 'month', 'day', 'hour'] ])
df = df.set_index('DATE')

生成日期序列

pd.date_range

months = pd.date_range('1973-01', '2020-01', freq='M')

如果要显示为‘‘yyyymm’’的格式(在读取数据时会用到,例如有些数据路径为/200701/...)

months = pd.date_range('1973-01', '2020-01', freq='M').strftime('%Y%M')

strftime是把时间转换成string格式

通过日期索引选取数据

df

1. 直接通过日期字符串选取

选取某一年:

In [27]: df['1973']

Out[27]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
1973-12-31 09:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 12:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 15:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-12-31 18:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-12-31 21:00:00  01001099999       -3.0        1  ...        1      NaN      NaN

[2801 rows x 19 columns]

或者选取某个区间内的年份:

In [39]: df['1973':'1974']   # 包括1994年
Out[39]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
1974-12-31 09:00:00  01001099999        0.0        1  ...        1      NaN      NaN
1974-12-31 12:00:00  01001099999        0.0        1  ...        1      NaN      NaN
1974-12-31 15:00:00  01001099999       -4.0        1  ...        1      NaN      NaN
1974-12-31 18:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1974-12-31 21:00:00  01001099999       -5.0        1  ...        1      NaN      NaN

[5517 rows x 19 columns]

按年份选取应该是最常用的。
如果只需要年平均之类的话,groupby(updating) ......
较复杂的逐年的数据处理,需要通过循环遍历所有年份,对每一年数据单独处理:

df_years = df.index.to_period('A').unique().year #将index按年份显示,取唯一值,再取年份
Out[29]: 
Int64Index([1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983,
            1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994,
            1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
            2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
            2017, 2018, 2019],
           dtype='int64')

for yr in df_years:
    df_yr = df[str(yr)]   # 索引得是字符
    balabala……

了解一下选月、日、小时:

 # 选取某月
df['1973-01']    #  df['197301']会报错
# 选取某天
df['1973-01-01'] # 我的数据包含小时、分钟数据所以这么写不报错
df['1973-01-01':'1973-01-01'] # 用区间来选取
# 选取某时刻
df['1973-01-01 06'] # (目前只知道最多选到小时为止)

.loc也可以,更保险

df.loc['1973':'1974']   # 得到的结果和上面的df['1973':'1974']是一样的

2. 通过条件筛选index

我一般是通过index来进行筛选。datetime格式的index可以提取year, month, day, hour的信息:

In [30]: df.index.month

Out[30]: 
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            ...
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64', length=98404)

然后用boolean筛选就行了:

In [31]: df[df.index.month==1]
Out[31]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 00:00:00  01001099999       -3.0        1  ...        1      NaN      NaN
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-01 12:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-01 18:00:00  01001099999       -6.0        1  ...        1      NaN      NaN
1973-01-02 00:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
                         ...        ...      ...  ...      ...      ...      ...
2020-01-30 18:00:00  01001099999        1.2        1  ...        1      NaN      NaN
2020-01-31 00:00:00  01001099999        1.2        1  ...        1      NaN      NaN
2020-01-31 09:00:00  01001099999        0.6        1  ...        1      NaN      NaN
2020-01-31 15:00:00  01001099999       -1.0        1  ...        1      NaN      NaN
2020-01-31 18:00:00  01001099999       -1.4        1  ...        1      NaN      NaN

[8786 rows x 19 columns]

3. truncate

可以去掉某日期之前或之后的数据。

dates = pd.date_range('2016-01-01', '2016-02-01', freq='s')
df = pd.DataFrame(index=dates, data={'A': 1})
 # 去掉1-5之前和1-10之后的行
df.truncate(before=pd.Timestamp('2016-01-05'),
            after=pd.Timestamp('2016-01-10')) 

# 也可以简单地写成:
df.truncate('2016-01-05', '2016-01-10')

# truncate方法默认只匹配到0时,最后一个值是2016-01-10 00:00:00  
# 对比df.loc['2016-01-05':'2016-01-10', :], 最后一个值是2016-01-10 23:59:59  

改变日期显示方式

1. to_period (updating

df.index.to_period('A').unique().year
# process day by day
dates = df_ww.index.to_period('d').strftime('%Y-%m-%d').unique()

2. asfreq (updating)

说实在的我还没搞清楚这个函数怎么用QAQ

3. astype('datetime64[M]')

转换成np.datetime64格式,返回的是array

df.index.values
Out[43]: 
array(['1978-01-01T00:00:00.000000000', '1978-01-01T06:00:00.000000000',
       '1978-01-01T12:00:00.000000000', ...,
       '2007-02-28T17:00:00.000000000', '2007-02-28T19:00:00.000000000',
       '2007-02-28T20:00:00.000000000'], dtype='datetime64[ns]')

df.index.values.astype('datetime64[M]')
Out[44]: 
array(['1978-01', '1978-01', '1978-01', ..., '2007-02', '2007-02',
       '2007-02'], dtype='datetime64[M]')

按年份/月份批量处理(计数、求和、平均): resample

df.resample('M')返回的是一个, 后面要再加上具体的method才可以显示结果,例如.count(), .sum(), .mean()...

# monthly number of ww observations
In [36]: num = df['WW_VALUE'].resample('M').count()

Out[36]: 
1973-01-31    217
1973-02-28    210
1973-03-31    242
1973-04-30    237
1973-05-31    243

2020-02-29    159
2020-03-31    180
2020-04-30    179
2020-05-31    186
2020-06-30     89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64

返回的index自动设置成了每月的最后一天。如果想只保留“年-月”格式的话,可以用to_period('m')

In [37]: num = df['WW_VALUE'].resample('M').count().to_period('m')   # "m"大小写没有影响。但“Y”和“y”格式有差别。
Out[37]: 
1973-01    217
1973-02    210
1973-03    242
1973-04    237
1973-05    243

2020-02    159
2020-03    180
2020-04    179
2020-05    186
2020-06     89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64

选取某个时间区间内的数据

Pandas.DataFrame.between_time

一个简单粗暴的选取时间段内数据的方法:

In [38]: df.between_time('5:00', '7:00')
Out[38]: 
                         STATION  TMP_VALUE TMP_FLAG  ...  AY_FLAG AZ_VALUE  AZ_FLAG
1973-01-01 06:00:00  01001099999       -2.0        1  ...        1      NaN      NaN
1973-01-02 06:00:00  01001099999       -9.0        1  ...        1      NaN      NaN
1973-01-03 06:00:00  01001099999       -5.0        1  ...        1      NaN      NaN
1973-01-05 06:00:00  01001099999      -10.0        1  ...        1      NaN      NaN
1973-01-06 06:00:00  01001099999      -11.0        1  ...        1      NaN      NaN

日期/时间增减Timedelta

选时间段,也可以通过时间的增减来完成。pandas的好处是会自动计算加减时间之后的日期。

In [40]: pd.Timedelta('6h')
Out[40]: Timedelta('0 days 06:00:00')

In [41]: pd.Timedelta(6,unit='h')
Out[41]: Timedelta('0 days 06:00:00')

In[42]: pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')
Out[42]: Timedelta('2 days 02:15:30')

Timedelta中的unit:

  • ‘Y’, ‘M’, ‘W’, ‘D’, 'H', ‘T’, ‘S’, ‘L’, ‘U’, or ‘N’ (我补充了一个'H'。对应年、月、周、日、时、分、秒、毫秒、微秒、纳秒)
  • ‘days’ or ‘day’
  • ‘hours’, ‘hour’, ‘hr’, or ‘h’
  • ‘minutes’, ‘minute’, ‘min’, or ‘m’
  • ‘seconds’, ‘second’, or ‘sec’
  • ...【其余参见官方文档】

0点前后一小时数据的选取方法:

repo00 = pd.to_datetime(date+' 00')
df_date00 = df[
                  (df.index >= repo00 - pd.Timedelta('1h'))
                & (df.index <= repo00 + pd.Timedelta('1h'))
                  ]

日期相加减

这个项目中,经常遇到时间不是整点的情况(例如5:45, 6:15),每天的数据量也有所不同。为了将数据限定在一日四次,我需要选出00, 06, 12, 18点前后1小时内的数据,并取其中与其最接近的观测值。这一点涉及时间的差,不能通过resample来实现。

# the closest observation within 1h of 06Z, 12Z, 18Z
df_date = df_ww[date]
obs06 = df_date.between_time('5:00', '7:00').index     # 06点的数据的index
delta06 = abs((obs06 - pd.to_datetime(date+' 06')).values)
if delta06.size > 0:
    df_resampled = pd.concat([df_resampled,
                              df_date.loc[[obs06[np.argmin(delta06)]]] 
                             ])

这里注意,两个datetime相减之后得到的结果以ns为单位。如果需要变换单位,可以用.dt.seconds或者.values取出数值,再进行变换。参考http://blog.gqylpy.com/gqy/22545/#pandas_74

df['diff_time'] = (df['tm_1'] - df['tm_2']).dt.seconds/60        # to minutes
df['diff_time'] = (df['tm_1'] - df['tm_2']).values/np.timedelta64(1, 'h')       # to hours

与numpy datetime64格式的比较

1.

起因是要做两个dataset的collocation,需要合并两个dataset相同时间下的气象变量。
其中一个tablet data读取为pd.DataFrame,由pd.to_datetime转换日期格式并设置为index:

In [31] df.index
Out[31]: 
DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 06:00:00',
               '1979-01-01 12:00:00', '1979-01-01 18:00:00',
               '1979-01-02 00:00:00', '1979-01-02 06:00:00',
               '1979-01-02 12:00:00', '1979-01-02 18:00:00',
               '1979-01-03 00:00:00', '1979-01-03 06:00:00',
               ...
               '1979-01-29 12:00:00', '1979-01-29 18:00:00',
               '1979-01-30 00:00:00', '1979-01-30 06:00:00',
               '1979-01-30 12:00:00', '1979-01-30 18:00:00',
               '1979-01-31 00:00:00', '1979-01-31 06:00:00',
               '1979-01-31 12:00:00', '1979-01-31 18:00:00'],
              dtype='datetime64[ns]', name='date', length=124, freq=None)

另一个是nc数据,用xarray读取变量,其time坐标默认为np.datetime64格式

In [33]: tk.time
Out[33]: 

array(['1979-01-01T00:00:00.000000000', '1979-01-01T01:00:00.000000000',
       '1979-01-01T02:00:00.000000000', ..., '1979-01-31T21:00:00.000000000',
       '1979-01-31T22:00:00.000000000', '1979-01-31T23:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1979-01-01 ... 1979-01-31T23:00:00
Attributes:
    long_name:  time

【我居然忘了当时纠结了很久的bug是什么】
如果写np.argwhere(df.index[0] == tk.time.values)得到的是空集
因此都用values提取数值进行对比即可

In [77]: np.argwhere(df.index.values[0] == tk.time.values)
Out[77]: array([[0]])

# 用int()提取数值:
In [78]: int(np.argwhere(df.index.values[0] == tk.time.values))
Out[78]: 0

2. Timestamp, datetime.datetime, np.datetime64之间的转换

本人暂时用不到。先参考这个链接https://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

pd.to_datetime再转换成np.datetime64的格式, 只需要加上np.array:

date_n = np.array(pd.to_datetime(df[['year', 'month', 'day', 'hour']]) )
Out[83]: 
array(['1979-01-01T00:00:00.000000000', '1979-01-01T06:00:00.000000000',
       '1979-01-01T12:00:00.000000000', '1979-01-01T18:00:00.000000000',
        ...
       '1979-01-31T00:00:00.000000000', '1979-01-31T06:00:00.000000000',
       '1979-01-31T12:00:00.000000000', '1979-01-31T18:00:00.000000000'],
      dtype='datetime64[ns]')

(1) python datetime => datetime64 / Timestamp

dt = datetime.datetime(year=2017, month=10, day=24, hour=4, 
                   minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')

>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')

(2) numpy datetime64 => Timestamp

In [86]: pd.Timestamp(np.datetime64('2012-05-01T01:00:00.000000'))
Out[86]: Timestamp('2012-05-01 01:00:00')

In [87]: pd.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[87]: Timestamp('2012-05-01 01:00:00+0100', tz='pytz.FixedOffset(60)')

In [88]: pd.to_datetime('2012-05-01T01:00:00.000000+0100').replace(tzinfo=None)
Out[88]: Timestamp('2012-05-01 01:00:00')

np.datetime64 => datetime似乎有点复杂。以后需要的话再查一查。

(3) Timestamp => datetime / datetime64

>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')

>>> ts.to_pydatetime()   # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)

>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')

模拟matlab的tic, toc计时功能

https://blog.csdn.net/u010199776/article/details/69941965

import datetime

tic = datetime.datetime.now()
...
toc = datetime.datetime.now()
print('%s, Elapsed time: %f seconds' % (toc, (toc-tic).total_seconds() ))

2020-07-19 16:23:13.321484, Elapsed time: 5.371069 seconds

你可能感兴趣的:(气象datetime处理合集【Updating】)