pd. read_csv解析日期格式字符串parse_dates参数，以及pd.to_datetime()、dateutil.parser.parse()与datetime.strptime()比较

在对San Fransisco的Police Department Incident Reports 历史数据整理时发现，2018年5月之前的数据中将日期和时间分开为了Date和Time两个列，而2018年之后的数据则只有Datetime一列，需要进行格式统一。

源数据文件中Date日期与Time时间分为了两列

read_csv命令中的parse_dates参数详解

由于csv文件中日期和时间被分为了两列，pd.read_csv命令读取文件时，需指定parse_dates = [ ['Date', 'Time'] ]，亦即将[ ['Date', 'Time'] ]两列的字符串先合并后解析方可。合并后的新列会以下划线'_'连接原列名命名，本例中列名为'Date_Time'。解析得到的日期格式列会作为DataFrame的第一列，在index_col指定表格中的第几列作为Index时需要小心。如本例中，指定参数index_col = 0，则此时会以新生曾的Date_Time列而不是IncidntNum作为Index。因此保险的方法是指定列名，如index_col = 'IncidntNum'。

read_csv指定parse_dates = [ ['Date', 'Time'] ]运行结果

如果写成了parse_dates = ['Date', 'Time'] ，pd. read_csv()会分别对'Date', 'Time'进行字符串转日期，此外还会造成一个小小的麻烦。由于本例中的Time时间列格式为‘HH:MM’，parse_dates 默认调用dateutil.parser.parse解析为Datetime时，在解析Time这一列时，会自作主张在前面加上一个当前日期。

read_csv指定parse_dates = ['Date', 'Time']运行结果

而且，read_csv指定parse_dates会使得读取csv文件的时间大大增加。该文件共220.5万条数据，不进行日期时间的解析时仅需不到10秒时间读取并转为DataFrame，读取同时解析日期竟则需要接近10分钟。仔细查阅了read_csv文档，发现当指定infer_datetime_format = True时, pandas会推荐日期字符串的格式从而使得解析速度加速5-10倍(原文如下)。

infer_datetime_format : bool, default False
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

实际运行下来加速效果明显，由原来的超过9分钟减少到了38.4秒。
此外，keep_date_col 参数则是用了指定解析为日期格式的列是否保留。文档中提到dayfirst 用来区分类似‘10/11/12’的字符串表示的到底是'dd/mm/yy'还是'mm/dd/yy'，美国习惯'MM/DD/YYYY'，但国际通用更常用的是'DD/MM/YYYY'。如果涉及到时区等，还需指定date_parser为'pandas.to_datetime() 并指定'utc=True'，具体参见Parsing a CSV with mixed timezones，此处不赘述。

infer_datetime_format = True可显著减少read_csv命令日期解析时间

pd.to_datetime()、dateutil.parser.parse()与datetime.strptime()比较

datetime.strptime()
datetime是Python处理日期和时间的标准库。datetime.strptime()是由字符串格式转化为日期格式的函数。

datetime.strptime()

函数执行需指定字符串的日期表示的格式，如下图所示。如果格式不匹配，如按照‘%d %m %y’格式解析‘21/04/2018’会报错（ValueError： time data '21/04/2018' does not match format '%d %m %Y'）

Directive	Description	Example Output
%a	Weekday as locale’s abbreviated name.	Sun, Mon, …, Sat
%A	Weekday as locale’s full name.	Sunday, Monday, …, Saturday
%w	Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.	0, 1, 2, 3, 4, 5, 6
%d	Day of the month as a zero-padded decimal number.	01, 02, …, 31
%b	Month as locale’s abbreviated name.	Jan, Feb, …, Dec
%B	Month as locale’s full name.	January, February, …, December
%m	Month as a zero-padded decimal number.	01, 02 … 12
%y	Year without century as a zero-padded decimal number.	01, 02, … 99
%Y	Year with century as a decimal number.	0001, 0002, … , 9999
%H	Hour (24-hour clock) as a zero-padded decimal number.	01, 02, … , 23
%I	Hour (12-hour clock) as a zero-padded decimal number.	01, 02, … , 12
%p	Locale’s equivalent of either AM or PM.	AM, PM
%M	Minute as a zero-padded decimal number.	01, 02, … , 59
%S	Second as a zero-padded decimal number.	01, 02, … , 59
%f	Microsecond as a decimal number, zero-padded on the left.	000000, 000001, …, 999999
%z	UTC offset in the form ±HHMM[SS] (empty string if the object is naive).	(empty), +0000, -0400, +1030
%Z	Time zone name (empty string if the object is naive).	(empty), UTC, IST, CST
%j	Day of the year as a zero-padded decimal number.	001, 002, …, 366
%U	Week number of the year (Sunday as the first day of the week) as a zero padded decimal number.All days in a new year preceding the first Sunday are considered to be in week 0.	00, 01, …, 53
%W	Week number of the year (Monday as the first day of the week) as a decimal number.All days in a new year preceding the first Monday are considered to be in week 0.	00, 01, …, 53
%c	Locale’s appropriate date and time representation.	Tue Aug 16 21:30:00 1988
%x	Locale’s appropriate date representation.	08/16/1988
%X	Locale’s appropriate time representation.	21:30:00
%%	A literal ‘%’ character.	%

datetime.strptime()要求严格且格式繁多，日期时间格式中没有指定的的部分会自动用‘1月1日零时零分’补齐。发现个很好玩的。用两位数的年份数'%y'执行下来，当数字大于68后，就会被程序认为是‘19xx’年。但为什么是68作为分水岭呢？
dateutil.parser.parse()
parse来自第三方包dateutil，也是read_csv默认的日期解析程序。Wes McKinney在《Python for Data Analysis, 2nd Edition》说parse可以解析几乎所有人类可以理解的日期表示形式。parse灵活性强，不用专门去指定字符串的格式，甚至可以从类似'Today is January 1, 2047 at 8:21:00AM'的这样一段文本中将日期和时间解析出来。对仅有2位或4位的年份时，parse()自动添加的是当前日期，与datetime.strptime()不同，此外parse('')会报错。

dateutil.parser.parse()运行结果

pd.to_datetime()
pd.to_datetime()也可以自动解析多种格式的日期，但比parse()谨慎，或者可以说，pd.to_datetime()的灵活性介于dateutil.parser.parse()与datetime.strptime()之间。除常规的'dayfirst'、'yearfirst'来区分类似'10/11/12'这类字符串中'年月日'的排列顺序之外，pd.to_datetime()执行时可通过'format'参数来指定具体的某一日期格式。'exact'参数则是指定是否强行查找出符合'format'格式的字符串来进行识别。

pd.to_datetime运行结果

运行速度对比
还是利用csv文件中Date、Time两列数据，先将字符串合并，然后分别用datetime.strptime()、dateutil.parser.parse()与pd.to_datetime()解析，结果如图所示。最快的datetime.strptime()只用了1分14秒完成了220.5万条数据的日期解析；dateutil.parser.parse()用了3分56秒，pd.to_datetime()居然用了18分34秒之多。pd.to_datetime()指定'format'参数后用时有所改善，但仍然用了10分36秒才完成。pd.to_datetime()参数中有一个与read_csv()命令相同的参数'infer_datetime_format'，但在这里指定infer_datetime_format = True似乎对运行速度没有影响。换个时间再试运行时间会有差异，但三者的速度排名不变。而且，这样看来最高效的方式反而是在read_csv()时就将日期解析完成。

datetime.strptime()、dateutil.parser.parse()与pd.to_datetime()速度对比

结论

pd.read_csv读入文件同时，如需利用parse_dates解析日期，尝试指定'infer_datetime_format' = True可能是个不错的选项；
datetime.strptime()运行速度更快，但须编写日期格式定义。dateutil.parser.parse()会自动判断日期格式，使用更为灵活。对于'20'这样的字符串，parse()会自动解析为'2020-xx-xx 00:00'（xx-xx为当前日期），可能会造成不必要的困扰。
pd.to_datetime()也能够自动判断日期格式，但较dateutil.parser.parse()谨慎。对于容易引起歧义的字符串，如'20'，pd.to_datetime()会报错。不过，它也是三者之中运行速度最慢的。
综上所述，在需要字符串转日期时，dateutil.parser.parse()值得优先选用。也难怪它会被pd.read_csv()指定为默认的日期解析程序了。

参考资料

pandas.read_csv
Parsing a CSV with mixed timezones
Python for Data Analysis, 2nd Edition
parser — dateutil 2.8.0 documentation
pandas.to_datetime — pandas 0.25.1 documentation

pd. read_csv解析日期格式字符串parse_dates参数，以及pd.to_datetime()、dateutil.parser.parse()与datetime.strptime()比较

read_csv命令中的parse_dates参数详解

pd.to_datetime()、dateutil.parser.parse()与datetime.strptime()比较

参考资料

你可能感兴趣的:(pd. read_csv解析日期格式字符串parse_dates参数，以及pd.to_datetime()、dateutil.parser.parse()与datetime.strptime()比较)